Thread: Horrific time for getting 1 record from an index?

Horrific time for getting 1 record from an index?

From

Jim Nasby

Date:

12 November 2013, 00:48:17

Postgres 9.1.9.

explain analyze select min(insert_time) from cnu_stats.page_hits_raw ;
                                                                               QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Result  (cost=0.12..0.13 rows=1 width=0) (actual time=257545.835..257545.836 rows=1 loops=1)
    InitPlan 1 (returns $0)
      ->  Limit  (cost=0.00..0.12 rows=1 width=8) (actual time=257545.828..257545.829 rows=1 loops=1)
            ->  Index Scan using page_hits_raw_pkey on page_hits_raw  (cost=0.00..5445004.65 rows=47165480 width=8)
(actualtime=257545.826..257545.826 rows=1 loops=1) 
                  Index Cond: (insert_time IS NOT NULL)
  Total runtime: 257545.881 ms
(6 rows)


I checked and there were no un-granted locks... but I have a hard time believing it actually too 257 seconds to get 2
pages(one index, one heap) back from our SAN. 

Am I missing something here?
--
Jim Nasby, Lead Data Architect   (512) 569-9461

Re: Horrific time for getting 1 record from an index?

From

Daniel Farina

Date:

12 November 2013, 00:52:30

On Mon, Nov 11, 2013 at 1:48 PM, Jim Nasby <jnasby@enova.com> wrote:
> Postgres 9.1.9.
>
> explain analyze select min(insert_time) from cnu_stats.page_hits_raw ;
>
> QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Result  (cost=0.12..0.13 rows=1 width=0) (actual
> time=257545.835..257545.836 rows=1 loops=1)
>    InitPlan 1 (returns $0)
>      ->  Limit  (cost=0.00..0.12 rows=1 width=8) (actual
> time=257545.828..257545.829 rows=1 loops=1)
>            ->  Index Scan using page_hits_raw_pkey on page_hits_raw
> (cost=0.00..5445004.65 rows=47165480 width=8) (actual
> time=257545.826..257545.826 rows=1 loops=1)
>                  Index Cond: (insert_time IS NOT NULL)
>  Total runtime: 257545.881 ms
> (6 rows)
>
>
> I checked and there were no un-granted locks... but I have a hard time
> believing it actually too 257 seconds to get 2 pages (one index, one heap)
> back from our SAN.

Try adding EXPLAIN (ANALYZE, BUFFERS).  I am wondering if you are
reading through a lot of pages addressing tuples not visible to the
transaction.

Re: Horrific time for getting 1 record from an index?

From

Jim Nasby

Date:

12 November 2013, 00:58:00

On 11/11/13 3:51 PM, Daniel Farina wrote:
> On Mon, Nov 11, 2013 at 1:48 PM, Jim Nasby <jnasby@enova.com> wrote:
>> Postgres 9.1.9.
>>
>> explain analyze select min(insert_time) from cnu_stats.page_hits_raw ;
>> I checked and there were no un-granted locks... but I have a hard time
>> believing it actually too 257 seconds to get 2 pages (one index, one heap)
>> back from our SAN.
>
> Try adding EXPLAIN (ANALYZE, BUFFERS).  I am wondering if you are
> reading through a lot of pages addressing tuples not visible to the
> transaction.

explain (analyze,buffers) select min(insert_time) from cnu_stats.page_hits_raw ;
                                                                            QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
  Result  (cost=0.12..0.13 rows=1 width=0) (actual time=119.347..119.347 rows=1 loops=1)
    Buffers: shared hit=1 read=9476
    InitPlan 1 (returns $0)
      ->  Limit  (cost=0.00..0.12 rows=1 width=8) (actual time=119.335..119.336 rows=1 loops=1)
            Buffers: shared hit=1 read=9476
            ->  Index Scan using page_hits_raw_pkey on page_hits_raw  (cost=0.00..5445004.65 rows=47165480 width=8)
(actualtime=119.333..119.333 rows=1 loops=1) 
                  Index Cond: (insert_time IS NOT NULL)
                  Buffers: shared hit=1 read=9476
  Total runtime: 119.382 ms
(9 rows)

We do run a regular process to remove older rows... I thought we were vacuuming after that process but maybe not.
--
Jim Nasby, Lead Data Architect   (512) 569-9461

Re: Horrific time for getting 1 record from an index?

From

Daniel Farina

Date:

12 November 2013, 01:00:41

On Mon, Nov 11, 2013 at 1:57 PM, Jim Nasby <jnasby@enova.com> wrote:
> We do run a regular process to remove older rows... I thought we were
> vacuuming after that process but maybe not.

Could be a long query, idle-in-transaction, prepared transaction, or
hot standby feedback gone bad, too.

Re: Horrific time for getting 1 record from an index?

From

Jeff Janes

Date:

12 November 2013, 01:57:21

On Mon, Nov 11, 2013 at 1:57 PM, Jim Nasby <jnasby@enova.com> wrote:

explain (analyze,buffers) select min(insert_time) from cnu_stats.page_hits_raw ;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.12..0.13 rows=1 width=0) (actual time=119.347..119.347 rows=1 loops=1)
Buffers: shared hit=1 read=9476
InitPlan 1 (returns $0)
-> Limit (cost=0.00..0.12 rows=1 width=8) (actual time=119.335..119.336 rows=1 loops=1)
Buffers: shared hit=1 read=9476
-> Index Scan using page_hits_raw_pkey on page_hits_raw (cost=0.00..5445004.65 rows=47165480 width=8) (actual time=119.333..119.333 rows=1 loops=1)

Index Cond: (insert_time IS NOT NULL)
Buffers: shared hit=1 read=9476
Total runtime: 119.382 ms
(9 rows)

We do run a regular process to remove older rows... I thought we were vacuuming after that process but maybe not.

Btree indexes have special code that kill index-tuples when the table-tuple is dead-to-all, so only the first such query after the mass deletion becomes vacuum-eligible should be slow, even if a vacuum is not done. But if there are long running transactions that prevent the dead rows from going out of scope, nothing can be done until those transactions go away.

Cheers,

Jeff

Re: Horrific time for getting 1 record from an index?

From

Jim Nasby

Date:

12 November 2013, 02:28:46

On 11/11/13 4:57 PM, Jeff Janes wrote:
> On Mon, Nov 11, 2013 at 1:57 PM, Jim Nasby <jnasby@enova.com <mailto:jnasby@enova.com>> wrote:
> Btree indexes have special code that kill index-tuples when the table-tuple is dead-to-all, so only the first such
queryafter the mass deletion becomes vacuum-eligible should be slow, even if a vacuum is not done.  But if there are
longrunning transactions that prevent the dead rows from going out of scope, nothing can be done until those
transactionsgo away. 

There is? I didn't know that, can you point me at code?

BTW, I originally had this, even after multiple queries:

            Buffers: shared hit=1 read=9476

Then vacuum:
INFO:  index "page_hits_raw_pkey" now contains 50343572 row versions in 182800 pages
DETAIL:  3466871 index row versions were removed.
44728 index pages have been deleted, 35256 are currently reusable.

Then...

            Buffers: shared hit=1 read=4

So I suspect a vacuum is actually needed...
--
Jim Nasby, Lead Data Architect   (512) 569-9461

Re: Horrific time for getting 1 record from an index?

From

Jeff Janes

Date:

12 November 2013, 03:31:05

On Mon, Nov 11, 2013 at 3:28 PM, Jim Nasby <jnasby@enova.com> wrote:

On 11/11/13 4:57 PM, Jeff Janes wrote:

On Mon, Nov 11, 2013 at 1:57 PM, Jim Nasby <jnasby@enova.com <mailto:jnasby@enova.com>> wrote:
Btree indexes have special code that kill index-tuples when the table-tuple is dead-to-all, so only the first such query after the mass deletion becomes vacuum-eligible should be slow, even if a vacuum is not done. But if there are long running transactions that prevent the dead rows from going out of scope, nothing can be done until those transactions go away.

There is? I didn't know that, can you point me at code?

git grep "kill_prior_tuple"

BTW, I originally had this, even after multiple queries:

Buffers: shared hit=1 read=9476

Then vacuum:
INFO: index "page_hits_raw_pkey" now contains 50343572 row versions in 182800 pages
DETAIL: 3466871 index row versions were removed.
44728 index pages have been deleted, 35256 are currently reusable.

Then...

Buffers: shared hit=1 read=4

So I suspect a vacuum is actually needed...

Hmm. Maybe the kill method doesn't unlink the empty pages from the tree?

Cheers,

Jeff

Re: Horrific time for getting 1 record from an index?

From

Jeff Janes

Date:

13 November 2013, 03:17:42

On Mon, Nov 11, 2013 at 4:30 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Mon, Nov 11, 2013 at 3:28 PM, Jim Nasby <jnasby@enova.com> wrote:
On 11/11/13 4:57 PM, Jeff Janes wrote:

On Mon, Nov 11, 2013 at 1:57 PM, Jim Nasby <jnasby@enova.com <mailto:jnasby@enova.com>> wrote:
Btree indexes have special code that kill index-tuples when the table-tuple is dead-to-all, so only the first such query after the mass deletion becomes vacuum-eligible should be slow, even if a vacuum is not done. But if there are long running transactions that prevent the dead rows from going out of scope, nothing can be done until those transactions go away.

There is? I didn't know that, can you point me at code?

git grep "kill_prior_tuple"

BTW, I originally had this, even after multiple queries:

Buffers: shared hit=1 read=9476

What were the timings like? Upon repeated execution it seems like all the buffers should be loaded and so be "hit", not "read".

Then vacuum:
INFO: index "page_hits_raw_pkey" now contains 50343572 row versions in 182800 pages
DETAIL: 3466871 index row versions were removed.
44728 index pages have been deleted, 35256 are currently reusable.

Then...

Buffers: shared hit=1 read=4

So I suspect a vacuum is actually needed...

Hmm. Maybe the kill method doesn't unlink the empty pages from the tree?

I verified that this is the case--the empty pages remain linked in the tree until a vacuum removes them. But walking through empty leaf pages is way faster than resolving pages full of pointers to dead-to-all tuple, so the kill code still gives a huge benefit. But of course nothing will do much good until the transaction horizon advances.

Cheers,

Jeff

Re: Horrific time for getting 1 record from an index?

From

Jim Nasby

Date:

13 November 2013, 03:22:50

On 11/12/13 6:17 PM, Jeff Janes wrote:
>         BTW, I originally had this, even after multiple queries:
>
>                     Buffers: shared hit=1 read=9476
>
>
> What were the timings like?  Upon repeated execution it seems like all the buffers should be loaded and so be "hit",
not"read". 

Well, the problem here is that this is a heavily hit 1.5TB database with 8GB of shared buffers... so stuff has to work
hardto stay in buffer (and I didn't run all this immediately one after the other). 

>         Then vacuum:
>         INFO:  index "page_hits_raw_pkey" now contains 50343572 row versions in 182800 pages
>         DETAIL:  3466871 index row versions were removed.
>         44728 index pages have been deleted, 35256 are currently reusable.
>
>         Then...
>
>                     Buffers: shared hit=1 read=4
>
>         So I suspect a vacuum is actually needed...
>
>
>     Hmm.  Maybe the kill method doesn't unlink the empty pages from the tree?
>
>
> I verified that this is the case--the empty pages remain linked in the tree until a vacuum removes them.  But walking
throughempty leaf pages is way faster than resolving pages full of pointers to dead-to-all tuple, so the kill code
stillgives a huge benefit.  But of course nothing will do much good until the transaction horizon advances. 

Aaaand... that gets to the other problem... our SAN performance is pretty abysmal. It took ~270 seconds to read 80MB of
indexpages (+ whatever heap) to get to the first live tuple. (This was run close enough to the vacuum that I don't
thinkvisibility of these tuples would have changed mid-stream). 
--
Jim Nasby, Lead Data Architect   (512) 569-9461

Re: Horrific time for getting 1 record from an index?

From

Merlin Moncure

Date:

13 November 2013, 18:03:15

On Tue, Nov 12, 2013 at 6:22 PM, Jim Nasby <jnasby@enova.com> wrote:
> On 11/12/13 6:17 PM, Jeff Janes wrote:
>>
>> I verified that this is the case--the empty pages remain linked in the
>> tree until a vacuum removes them.  But walking through empty leaf pages is
>> way faster than resolving pages full of pointers to dead-to-all tuple, so
>> the kill code still gives a huge benefit.  But of course nothing will do
>> much good until the transaction horizon advances.
>
>
> Aaaand... that gets to the other problem... our SAN performance is pretty
> abysmal. It took ~270 seconds to read 80MB of index pages (+ whatever heap)
> to get to the first live tuple. (This was run close enough to the vacuum
> that I don't think visibility of these tuples would have changed
> mid-stream).

That's awful, but 'par for the course' for SANs in my experience.  If
my math is right, that works out to 27ms / page read.   But each index
page read can cause multiple heap page reads depending on how the data
is organized so I think you are up against the laws of physics.   All
we can do is to try and organize data so that access patterns are less
radom and/or invest in modern hardware.

merlin