Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CAM3SWZSA_+CBsaYyb22cdqwkfNgD9vQL74NVYnbu62ZqhLD4Og@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel tuplesort (for parallel B-Tree index creation) (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Parallel tuplesort (for parallel B-Tree index creation)
Re: Parallel tuplesort (for parallel B-Tree index creation) |
List | pgsql-hackers |
On Mon, Oct 24, 2016 at 6:17 PM, Peter Geoghegan <pg@heroku.com> wrote: >> * Cost model. Should probably attempt to guess final index size, and >> derive calculation of number of workers from that. Also, I'm concerned >> that I haven't given enough thought to the low end, where with default >> settings most CREATE INDEX statements will use at least one parallel >> worker. > While I haven't made progress on any of these open items, I should > still get a version out that applies cleanly on top of git tip -- > commit b75f467b6eec0678452fd8d7f8d306e6df3a1076 caused the patch to > bitrot. I attach V4, which is a fairly mechanical rebase of V3, with > no notable behavioral changes or bug fixes. I attach V5. Changes: * A big cost model overhaul. Workers are logarithmically scaled based on projected final *index* size, not current heap size, as was the case in V4. A new nbtpage.c routine is added to estimate a not-yet-built B-Tree index's size, now called by the optimizer. This involves getting average item width for indexed attributes from pg_attribute for the heap relation. There are some subtleties here with partial indexes, null_frac, etc. I also refined the cap applied on the number of workers that limits too many workers being launched when there isn't so much maintenance_work_mem. The cost model is much improved now -- it is now more than just a placeholder, at least. It doesn't do things like launch a totally inappropriate number of workers to build a very small partial index. Granted, those workers would still have something to do -- scan the heap -- but not enough to justify launching so many (that is, launching as many as would be launched for an equivalent non-partial index). That having been said, things are still quite fudged here, and I struggle to find any guiding principle around doing better on average. I think that that's because of the inherent difficulty of modeling what's going on, but I'd be happy to be proven wrong on that. In any case, I think it's going to be fairly common for DBAs to want to use the storage parameter to force the use of a particular number of parallel workers. (See also: my remarks below on how the new bt_estimate_nblocks() SQL-callable function can give insight into the new cost model's decisions.) * Overhauled leader_mergeruns() further, to make it closer to mergeruns(). We now always rewind input tapes. This simplification involved refining some of the assertions within logtape.c, which is also slightly simplified. * 2 new testing tools are added to the final commit in the patch series (not actually proposed for commit). I've added 2 new SQL-callable functions to contrib/pageinspect. The 2 new testing functions are: bt_estimate_nblocks ------------------- bt_estimate_nblocks() provides an easy way to see the optimizer's projection of how large the final index will be. It returns an estimate in blocks. Example: mgd=# analyze; ANALYZE mgd=# select oid::regclass as rel, bt_estimated_nblocks(oid), relpages, to_char(bt_estimated_nblocks(oid)::numeric / relpages, 'FM990.990') as estimate_actual from pg_class where relkind = 'i' order by relpages desc limit 20; rel │ bt_estimated_nblocks │ relpages │ estimate_actual ────────────────────────────────────────────────────┼──────────────────────┼──────────┼───────────────── mgd.acc_accession_idx_accid │ 107,091 │ 106,274 │ 1.008 mgd.acc_accession_0 │ 169,024 │ 106,274 │ 1.590 mgd.acc_accession_1 │ 169,024 │ 80,382 │ 2.103 mgd.acc_accession_idx_prefixpart │ 76,661 │ 80,382 │ 0.954 mgd.acc_accession_idx_mgitype_key │ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_idx_clustered │ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_idx_createdby_key │ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_idx_numericpart │ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_idx_logicaldb_key │ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_idx_modifiedby_key │ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_pkey │ 76,661 │ 76,928 │ 0.997 mgd.mgi_relationship_property_idx_propertyname_key │ 74,197 │ 74,462 │ 0.996 mgd.mgi_relationship_property_idx_modifiedby_key │ 74,197 │ 74,462 │ 0.996 mgd.mgi_relationship_property_pkey │ 74,197 │ 74,462 │ 0.996 mgd.mgi_relationship_property_idx_clustered │ 74,197 │ 74,462 │ 0.996 mgd.mgi_relationship_property_idx_createdby_key │ 74,197 │ 74,462 │ 0.996 mgd.seq_sequence_idx_clustered │ 50,051 │ 50,486 │ 0.991 mgd.seq_sequence_raw_pkey │ 35,826 │ 35,952 │ 0.996 mgd.seq_sequence_raw_idx_modifiedby_key │ 35,826 │ 35,952 │ 0.996 mgd.seq_source_assoc_idx_clustered │ 35,822 │ 35,952 │ 0.996 (20 rows) I haven't tried to make the underlying logic as close to perfect as possible, but it tends to be accurate in practice, as is evident from this real-world example (this shows larger indexes following a restoration of the mouse genome sample database [1]). Perhaps there could be a role for a refined bt_estimate_nblocks() function in determining when B-Tree indexes become bloated/unbalanced (maybe have pgstatindex() estimate index bloat based on a difference between projected and actual fan-in?). That has nothing to do with parallel CREATE INDEX, though. bt_main_forks_identical ----------------------- bt_main_forks_identical() checks if 2 specific relations have bitwise identical main forks. If they do, it returns the number of blocks in the main fork of each. Otherwise, an error is raised. Unlike any approach involving *writing* the index in parallel (e.g., any worthwhile approach based on data partitioning), the proposed parallel CREATE INDEX implementation creates an identical index representation to that created by any serial process (including, for example, the master branch when CREATE INDEX uses an internal sort). The index that you end up with when parallelism is used ought to be 100% identical in all cases. (This is true because there is a TID tie-breaker when sorting B-Tree index tuples, and because LSNs are set to 0 by CREATE INDEX. Why not exploit that fact to test the implementation?) If anyone can demonstrate that parallel CREATE INDEX fails to create a non-bitwise-identical index representation to a "known good" implementation, or can demonstrate that it doesn't consistently produce exactly the same final index representation given the same underlying table as input, then there *must* be a bug. bt_main_forks_identical() gives reviewers an easy way to verify this, perhaps just in passing during benchmarking. pg_restore ========== It occurs to me that parallel CREATE INDEX receives no special consideration by pg_restore. This leaves it so that the use of parallel CREATE INDEX can come down to whether or not pg_class.reltuples is accidentally updated by something like an initial CREATE INDEX. This is not ideal. There is also the questions of how pg_restore -j cases ought to give special consideration to parallel CREATE INDEX, if at all -- it's probably true that concurrent index builds on the same relation do go together well with parallel CREATE INDEX, but even in V5 pg_restore remains totally naive about this. That having been said, pg_restore currently does nothing clever with maintenance_work_mem when multiple jobs are used, even though that seems at least as useful as what I outline for parallel CREATE INDEX. It's not clear how to judge this. What do we need to teach pg_restore about parallel CREATE INDEX, if anything at all? Could this be as simple as a blanket disabling of parallelism for CREATE INDEX from pg_restore? Or, does it need to be more sophisticated than that? I suppose that tools like reindexdb and pgbench must be considered in a similar way. Maybe we could get the number of blocks in the heap relation when its pg_class.reltupes is 0, from the smgr, and then extrapolate the reltuples using simple, generic logic, in the style of vac_estimate_reltuples() (its "old_rel_pages" == 0 case). For now, I've avoided doing that out of concern for the overhead in cases where there are many small tables to be restored, and because it may be better to err on the side of not using parallelism. [1] https://wiki.postgresql.org/wiki/Sample_Databases -- Peter Geoghegan
Attachment
pgsql-hackers by date: