Re: Unexpected expensive index scan - Mailing list pgsql-performance

From Jake Nielsen
Subject Re: Unexpected expensive index scan
Date
Msg-id CAP3LSG6UgeTWGY7_NQr6b1WRVdAnKdWqa0S=1mOuVq_hWK5OHw@mail.gmail.com
Whole thread Raw
In response to Re: Unexpected expensive index scan  ("Mike Sofen" <msofen@runbox.com>)
Responses Re: Unexpected expensive index scan
List pgsql-performance


On Tue, Sep 27, 2016 at 5:41 PM, Mike Sofen <msofen@runbox.com> wrote:

From: Jake Nielsen    Sent: Tuesday, September 27, 2016 5:22 PM


the query

SELECT * FROM SyncerEvent WHERE ID > 12468 AND propogatorId NOT IN ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"') AND conflicted != 1 AND userId = '57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;^

 

On Tue, Sep 27, 2016 at 5:02 PM, Jake Nielsen <jake.k.nielsen@gmail.com> wrote:

I've got a query that takes a surprisingly long time to run, and I'm having a really rough time trying to figure it out.

 

Before I get started, here are the specifics of the situation:

 

Here is the table that I'm working with (apologies for spammy indices, I've been throwing shit at the wall)

                            Table "public.syncerevent"

    Column    |  Type   |                        Modifiers                         

--------------+---------+----------------------------------------------------------

 id           | bigint  | not null default nextval('syncerevent_id_seq'::regclass)

 userid       | text    | 

 event        | text    | 

 eventid      | text    | 

 originatorid | text    | 

 propogatorid | text    | 

 kwargs       | text    | 

 conflicted   | integer | 

Indexes:

    "syncerevent_pkey" PRIMARY KEY, btree (id)

    "syncereventidindex" UNIQUE, btree (eventid)

    "anothersyncereventidindex" btree (userid)

    "anothersyncereventidindexwithascending" btree (userid, id)

    "asdfasdgasdf" btree (userid, id DESC)

    "syncereventuseridhashindex" hash (userid)

 

To provide some context, as per the wiki, 

there are 3,290,600 rows in this table. 

It gets added to frequently, but never deleted from. 

The "kwargs" column often contains mid-size JSON strings (roughly 30K characters on average)

As of right now, the table has 53 users in it. About 20% of those have a negligible number of events, but the rest of the users have a fairly even smattering.

 

EXPLAIN (ANALYZE, BUFFERS) says:

                                                                          QUERY PLAN                                                                          

--------------------------------------------------------------------------------------------------------------------------------------------------------------

 Limit  (cost=0.43..1218.57 rows=4000 width=615) (actual time=3352.390..3403.572 rows=4000 loops=1)  Buffers: shared hit=120244 read=160198

   ->  Index Scan using syncerevent_pkey on syncerevent  (cost=0.43..388147.29 rows=1274560 width=615) (actual time=3352.386..3383.100 rows=4000 loops=1)

         Index Cond: (id > 12468)

         Filter: ((propogatorid <> '"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"'::text) AND (conflicted <> 1) AND (userid = '57dc984f1c87461c0967e228'::text))

         Rows Removed by Filter: 1685801

         Buffers: shared hit=120244 read=160198

 Planning time: 0.833 ms

 Execution time: 3407.633 ms

(9 rows)

If it matters/interests you, here is my underlying confusion:

Any help in making this query more efficient would be greatly appreciated, and any conceptual insights would be extra awesome.

Thanks for reading.

-Jake

----------------------

 

This stands out:  WHERE ID > 12468 AND propogatorId NOT IN ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"')

As does this from the analyze:  Rows Removed by Filter: 1685801

 

The propogaterid is practically the only column NOT indexed and it’s used in a “not in”.  It looks like it’s having to do a table scan for all the rows above the id cutoff to see if any meet the filter requirement.  “not in” can be very expensive.  An index might help on this column.  Have you tried that?

 

Your rowcounts aren’t high enough to require partitioning or any other changes to your table that I can see right now.

 

Mike Sofen  (Synthetic Genomics)

 


Thanks Mike, that's true, I hadn't thought of non-indexed columns forcing a scan. Unfortunately, just to test this out, I tried pulling out the more suspect parts of the query, and it still seems to want to do an index scan:


EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM SyncerEvent WHERE userId = '57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;

                                                                        QUERY PLAN                                                                        

----------------------------------------------------------------------------------------------------------------------------------------------------------

 Limit  (cost=0.43..1140.62 rows=4000 width=615) (actual time=2706.365..2732.308 rows=4000 loops=1)

   Buffers: shared hit=120239 read=161924

   ->  Index Scan using syncerevent_pkey on syncerevent  (cost=0.43..364982.77 rows=1280431 width=615) (actual time=2706.360..2715.514 rows=4000 loops=1)

         Filter: (userid = '57dc984f1c87461c0967e228'::text)

         Rows Removed by Filter: 1698269

         Buffers: shared hit=120239 read=161924

 Planning time: 0.131 ms

 Execution time: 2748.526 ms

(8 rows)


It definitely looks to me like it's starting at the ID = 12468 row, and just grinding up the rows. The filter is (unsurprisingly) false for most of the rows, so it ends up having to chew up half the table before it actually finds 4000 rows that match.

After creating a partial index using that userId, things go way faster. This is more-or-less what I assumed I'd get by making having that multi-column index of (userId, Id), but alas:

remoteSyncerLogistics=> CREATE INDEX sillyIndex ON syncerevent (ID) where userId = '57dc984f1c87461c0967e228';

CREATE INDEX

remoteSyncerLogistics=> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM SyncerEvent WHERe userId = '57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;

                                                                  QUERY PLAN                                                                  

----------------------------------------------------------------------------------------------------------------------------------------------

 Limit  (cost=0.43..443.21 rows=4000 width=615) (actual time=0.074..13.349 rows=4000 loops=1)

   Buffers: shared hit=842 read=13

   ->  Index Scan using sillyindex on syncerevent  (cost=0.43..141748.41 rows=1280506 width=615) (actual time=0.071..5.372 rows=4000 loops=1)

         Buffers: shared hit=842 read=13

 Planning time: 0.245 ms

 Execution time: 25.404 ms

(6 rows)


remoteSyncerLogistics=> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM SyncerEvent WHERe userId = '57dc984f1c87461c0967e228' AND ID > 12468 ORDER BY ID LIMIT 4000;

                                                                  QUERY PLAN                                                                  

----------------------------------------------------------------------------------------------------------------------------------------------

 Limit  (cost=0.43..453.34 rows=4000 width=615) (actual time=0.023..13.244 rows=4000 loops=1)

   Buffers: shared hit=855

   ->  Index Scan using sillyindex on syncerevent  (cost=0.43..144420.43 rows=1275492 width=615) (actual time=0.020..5.392 rows=4000 loops=1)

         Index Cond: (id > 12468)

         Buffers: shared hit=855

 Planning time: 0.253 ms

 Execution time: 29.371 ms

(7 rows)


Any thoughts?

-Jake

pgsql-performance by date:

Previous
From: "Mike Sofen"
Date:
Subject: Re: Unexpected expensive index scan
Next
From: Jake Nielsen
Date:
Subject: Re: Unexpected expensive index scan