Re: Reducing planning time on tables with many indexes - Mailing list pgsql-hackers

From David Geier
Subject Re: Reducing planning time on tables with many indexes
Date
Msg-id 21803d4b-d5e2-58a5-d920-97d285bf571b@gmail.com
Whole thread Raw
In response to Re: Reducing planning time on tables with many indexes  (David Geier <geidav.pg@gmail.com>)
Responses Re: Reducing planning time on tables with many indexes
List pgsql-hackers
On 8/1/22 15:33, David Geier wrote:
Hi Tom,

On Wed, Jul 27, 2022 at 7:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
> Unfortunately, as things stand today, the planner needs more than the
> right to look at the indexes' schemas, because it makes physical accesses
> to btree indexes to find out their tree height (and I think there are some
> comparable behaviors in other AMs).  I've never particularly cared for
> that implementation, and would be glad to rip out that behavior if we can
> find another way.  Maybe we could persuade VACUUM or ANALYZE to store that
> info in the index's pg_index row, or some such, and then the planner
> could use it with no lock?
It seems like _bt_getrootheight() first checks if the height is cached and only if it isn't it accesses index meta pages.
If the index locks are only taken for the sake of _bt_getrootheight() accessing index meta pages in case they are not cached, maybe the index locks could be taken conditionally.
However, postponing the call to where it is really needed sounds even better.

A first step here could just be to postpone fetching _bt_getrootheight()
until we actually need it during cost estimation.  That would avoid the
need to do it at all for indexes that indxpath.c discards as irrelevant,
which is a decision made on considerably more information than the
proposed patch uses.

Hi Tom,

I gave the idea of moving _bt_getrootheight() into costsize.c and filling IndexOptInfo in get_relation_info() via syscache instead of relcache a try, but didn't get very far.
Moving out _bt_getrootheight() was straightforward, and we should do nevertheless. However, it seems like get_relation_info() strongly depends on the index's Relation for quite some stuff. A fair amount of fields I could actually fill from syscache, but there are some that either need data not stored in syscache (e.g. estimate_rel_size(), Relation::rd_smgr needed by RelationGetNumberOfBlocksInFork()) or need fields that are cached in the index's Relation and would have to be recomputed otherwise (e.g. Relation::rd_indexprs filled by RelationGetIndexExpressions(), Relation::rd_indpred filled by RelationGetIndexPredicate()). Even if we could somehow obtain the missing info from somewhere, recomputing the otherwise cached fields from Relation would likely cause a significant slowdown in the serial case.

Beyond that I did some off-CPU profiling to precisely track down which lock serializes execution. It turned out to be the MyProc::fpInfoLock lightweight lock. This lock is used in the fast path of the heavyweight lock. In the contenting case, fpInfoLock is acquired in LW_EXCLUSIVE mode to (1) check if there is no other process holding a stronger lock, and if not, to reserve a process local fast path lock slot and (2) to return the fast path lock slots all in one go. To do so, the current implementation always linearly iterates over all lock slots. The corresponding call stacks are:

get_relation_info()               CommitTransaction()
  index_open()                      ResourceOwnerRelease()
    relation_open()                   ResourceOwnerReleaseInternal()    
      LockRelationOid()                 ProcReleaseLocks()
        LockAcquireExtended()             LockReleaseAll() <-- called twice from ProcReleaseLocks()
          LWLockAcquire()

On top of that there are only 16 fast path lock slots. One slot is always taken up by the parent relation, leaving only 15 slots for the indexes. As soon as a session process runs out of slots, it falls back to the normal lock path which has to mess around with the lock table. To do so it also acquires a lightweight lock in LW_EXCLUSIVE mode. This lightweight lock however is partitioned and therefore does not content. Hence, normal lock acquisition is slower but contents less.

To prove above findings I increased the number of fast path lock slots per connection and optimized FastPathGrantRelationLock() and FastPathUnGrantRelationLock(). With these changes the lock contention disappeared and the workload scales linearly (the code I tested with also included moving out _bt_getrootheight()):

| Parallel streams | TPS      | TPS / stream  |
|------------------|----------|---------------|
| 1                |   5,253  | 5,253         |
| 10               |  51,406  | 5,140         |
| 20               | 101,401  | 5,070         |
| 30               | 152,023  | 5,067         |
| 40               | 200,607  | 5,015         |
| 50               | 245,359  | 4,907         |
| 60               | 302,994  | 5,049         |

However, with the very same setup, the index filtering approach yields 486k TPS with 60 streams and 9,827 TPS with a single stream. The single stream number shows that this is not because it scales even better, but just because less work is spent during planning. A quick perf session showed that a fair amount of time is spent to get the relation sizes in blocks (RelationGetNumberOfBlocksInFork() -> lseek64()) and creating index paths (pull_varattnos() -> bms_add_member(), surprisingly).

-   32.20%     1.58%  postgres  postgres            [.]
get_relation_info
   - 30.62% get_relation_info
      - 16.56% RelationGetNumberOfBlocksInFork
         - 16.42% smgrnblocks
            - 16.25% mdnblocks
               - 16.10% _mdnblocks
                  + 15.55% __libc_lseek64
      + 5.83% index_open
      + 2.71% estimate_rel_size
        1.56% build_index_tlist
      + 1.22% palloc
   + 1.57% __libc_start_main
-   23.02%     0.03%  postgres  postgres            [.]
make_one_rel
   - 22.99% make_one_rel
      - 22.01% set_base_rel_pathlists
         - 21.99% set_rel_pathlist
            - 21.89% set_plain_rel_pathlist
               - 21.53% create_index_paths
                  - 18.76% get_index_paths
                     - 18.33% build_index_paths
                        - 15.77% check_index_only
                           - 14.75% pull_varattnos
                              - 14.58% pull_varattnos_walker
                                 - 13.05% expression_tree_walker
                                    - 9.50% pull_varattnos_walker
                                         5.77% bms_add_member
                                      0.93% bms_add_member
                                      0.52% expression_tree_walker
                                   1.44% pull_varattnos_walker
                        + 1.79% create_index_path
                  + 0.90% match_restriction_clauses_to_index
      + 0.95% set_base_rel_sizes

Given the findings above, the two patches are actually complementary. Optimizing the lock fast path not only helps when many indexes exist and only a small subset is used, but whenever there are many locks used by a query. The index filtering is another way to reduce lock contention, but beyond that also greatly reduces the time spent on planning in the serial case.

I have attached the patch to improve the heavyweight lock fast path. It also for now contains moving out _bt_getrootheight(). For workloads where the same set of locks is used over and over again, it only needs on average a single loop iteration to find the relation (instead of a linear scan before). This allows to increase the number of fast path locks by a lot. In this patch I increased them from 16 to 64. The code can be further improved for cases where to be locked relations change frequently and therefore the chance of not finding a relation and because of that having to linearly search the whole array is higher.

I would really appreciate your feedback Tom, also on the questions around the approach of filtering out indexes, discussed in the last mails.

--
David Geier
(ServiceNow)

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Fix typo with logical connector (src/backend/commands/vacuumparallel.c)
Next
From: Tom Lane
Date:
Subject: Re: Fix typo with logical connector (src/backend/commands/vacuumparallel.c)