Thread: Parallel tuplesort (for parallel B-Tree index creation)
As some of you know, I've been working on parallel sort. I think I've gone as long as I can without feedback on the design (and I see that we're accepting stuff for September CF now), so I'd like to share what I came up with. This project is something that I've worked on inconsistently since late last year. It can be thought of as the Postgres 10 follow-up to the 9.6 work on external sorting. Attached WIP patch series: * Adds a parallel sorting capability to tuplesort.c. * Adds a new client of this capability: btbuild()/nbtsort.c can now create B-Trees in parallel. Most of the complexity here relates to the first item; the tuplesort module has been extended to support sorting in parallel. This is usable in principle by every existing tuplesort caller, without any restriction imposed by the newly expanded tuplesort.h interface. So, for example, randomAccess MinimalTuple support has been added, although it goes unused for now. I went with CREATE INDEX as the first client of parallel sort in part because the cost model and so on can be relatively straightforward. Even CLUSTER uses the optimizer to determine if a sort strategy is appropriate, and that would need to be taught about parallelism if its tuplesort is to be parallelized. I suppose that I'll probably try to get CLUSTER (with a tuplesort) done in the Postgres 10 development cycle too, but not just yet. For now, I would prefer to focus discussion on tuplesort itself. If you can only look at one part of this patch, please look at the high-level description of the interface/caller contract that was added to tuplesort.h. Performance =========== Without further ado, I'll demonstrate how the patch series improves performance in one case. This benchmark was run on an AWS server with many disks. A d2.4xlarge instance was used, with 16 vCPUs, 122 GiB RAM, 12 x 2 TB HDDs, running Amazon Linux. Apparently, this AWS instance type can sustain 1,750 MB/second of I/O, which I was able to verify during testing (when a parallel sequential scan ran, iotop reported read throughput slightly above that for multi-second bursts). Disks were configured in software RAID0. These instances have disks that are optimized for sequential performance, which suits the patch quite well. I don't usually trust AWS EC2 for performance testing, but it seemed to work well here (results were pretty consistent). Setup: CREATE TABLE parallel_sort_test AS SELECT hashint8(i) randint, md5(i::text) collate "C" padding1, md5(i::text || '2') collate "C" padding2 FROM generate_series(0, 1e9::bigint) i; CHECKPOINT; This leaves us with a parallel_sort_test table that is 94 GB in size. SET maintenance_work_mem = '8GB'; -- Serial case (external sort, should closely match master branch): CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH (parallel_workers = 0); Total time: 00:15:42.15 -- Patch with 8 tuplesort "sort-and-scan" workers (leader process participates as a worker here): CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH (parallel_workers = 7); Total time: 00:06:03.86 As you can see, the parallel case is 2.58x faster (while using more memory, though it's not the case that a higher maintenance_work_mem setting speeds up the serial/baseline index build). 8 workers are a bit faster than 4, but not by much (not shown). 16 are a bit slower, but not by much (not shown). trace_sort output for "serial_idx" case: """ begin index sort: unique = f, workMem = 8388608, randomAccess = f switching to external sort with 501 tapes: CPU 7.81s/25.54u sec elapsed 33.95 sec *** SNIP *** performsort done (except 7-way final merge): CPU 53.52s/666.89u sec elapsed 731.67 sec external sort ended, 2443786 disk blocks used: CPU 74.40s/854.52u sec elapsed 942.15 sec """ trace_sort output for "patch_8_idx" case: """ begin index sort: unique = f, workMem = 8388608, randomAccess = f *** SNIP *** sized memtuples 1.62x from worker's 130254158 (3052832 KB) to 210895910 (4942873 KB) for leader merge (0 KB batch memory conserved) *** SNIP *** tape -1/7 initially used 411907 KB of 430693 KB batch (0.956) and 26361986 out of 26361987 slots (1.000) performsort done (except 8-way final merge): CPU 12.28s/101.76u sec elapsed 129.01 sec parallel external sort ended, 2443805 disk blocks used: CPU 30.08s/318.15u sec elapsed 363.86 sec """ This is roughly the degree of improvement that I expected when I first undertook this project late last year. As I go into in more detail below, I believe that we haven't exhausted all avenues to make parallel CREATE INDEX faster still, but I do think what's left on the table is not enormous. There is less benefit when sorting on a C locale text attribute, because the overhead of merging dominates parallel sorts, and that's even more pronounced with text. So, many text cases tend to work out at about only 2x - 2.2x faster. We could work on this indirectly. I've seen cases where a CREATE INDEX ended up more than 3x faster, though. I benchmarked this case in the interest of simplicity (the serial case is intended to be comparable, making the test fair). Encouragingly, as you can see from the trace_sort output, the 8 parallel workers are 5.67x faster at getting to the final merge (a merge that even it performs serially). Note that the final merge for each CREATE INDEX is comparable (7 runs vs. 8 runs from each of 8 workers). Not bad! Design: New, key concepts for tuplesort.c ========================================== The heap is scanned in parallel, and worker processes also merge in parallel if required (it isn't required in the example above). The implementation makes heavy use of existing external sort infrastructure. In fact, it's almost the case that the implementation is a generalization of external sorting that allows workers to perform heap scanning and run sorting independently, with tapes then "unified" in the leader process for merging. At that point, the state held by the leader is more or less consistent with the leader being a serial external sort process that has reached its merge phase in the conventional manner (serially). The steps callers must take are described fully in tuplesort.h. The general idea is that a Tuplesortstate is aware that it might not be a self-contained sort; it may instead be one part of a parallel sort operation. You might say that the tuplesort caller must "build its own sort" from participant worker process Tuplesortstates. The caller creates a dynamic shared memory segment + TOC for each parallel sort operation (could be more than one concurrent sort operation, of course), passes that to tuplesort to initialize and manage, and creates a "leader" Tuplesortstate in private memory, plus one or more "worker" Tuplesortstates, each presumably managed by a different parallel worker process. tuplesort.c does most of the heavy lifting, including having processes wait on each other to respect its ordering dependencies. Caller is responsible for spawning workers to do the work, reporting details of the workers to tuplesort through shared memory, and having workers call tuplesort to actually perform sorting. Caller consumes final output through leader Tuplesortstate in leader process. I think that this division of labor works well for us. Tape unification ---------------- Sort operations have a unique identifier, generated before any workers are launched, using a scheme based on the leader's PID, and a unique temp file number. This makes all on-disk state (temp files managed by logtape.c) discoverable by the leader process. State in shared memory is sized in proportion to the number of workers, so the only thing about the data being sorted that gets passed around in shared memory is a little logtape.c metadata for tapes, describing for example how large each constituent BufFile is (a BufFile associated with one particular worker's tapeset). (See below also for notes on buffile.c's role in all of this, fd.c and resource management, etc.) workMem ------- Each worker process claims workMem as if it was an independent node. The new implementation reuses much of what was originally designed for external sorts. As such, parallel sorts are necessarily external sorts, even when the workMem (i.e. maintenance_work_mem) budget could in principle allow for parallel sorting to take place entirely in memory. The implementation arguably *insists* on making such cases external sorts, when they don't really need to be. This is much less of a problem than you might think, since the 9.6 work on external sorting does somewhat blur the distinction between internal and external sorts (just consider how much time trace_sort indicates is spent waiting on writes in workers; it's typically a small part of the total time spent). Since parallel sort is really only compelling for large sorts, it makes sense to make them external, or at least to prioritize the cases that should be performed externally. Anyway, workMem-not-exceeded cases require special handling to not completely waste memory. Statistics about worker observations are used at later stages, to at least avoid blatant waste, and to ensure that memory is used optimally more generally. Merging ======= The model that I've come up with is that every worker process is guaranteed to output one materialized run onto one tape for the leader to merge within from its "unified" tapeset. This is the case regardless of how much workMem is available, or any other factor. The leader always assumes that the worker runs/tapes are present and discoverable based only on the number of known-launched worker processes, and a little metadata on each that is passed through shared memory. Producing one output run/materialized tape from all input tuples in a worker often happens without the worker running out of workMem, which you saw above. A straight quicksort and dump of all tuples is therefore possible, without any merging required in the worker. Alternatively, it may prove necessary to do some amount of merging in each worker to generate one materialized output run. This case is handled in the same way as a randomAccess case that requires one materialized output tape to support random access by the caller. This worker merging does necessitate another pass over all temp files for the worker, but that's a much lower cost than you might imagine, in part because the newly expanded use of batch memory makes merging here cache efficient. Batch allocation is used for all merging involved here, not just the leader's own final-on-the-fly merge, so merging is consistently cache efficient. (Workers that must merge on their own are therefore similar to traditional randomAccess callers, so these cases become important enough to optimize with the batch memory patch, although that's still independently useful.) No merging in parallel ---------------------- Currently, merging worker *output* runs may only occur in the leader process. In other words, we always keep n worker processes busy with scanning-and-sorting (and maybe some merging), but then all processes but the leader process grind to a halt (note that the leader process can participate as a scan-and-sort tuplesort worker, just as it will everywhere else, which is why I specified "parallel_workers = 7" but talked about 8 workers). One leader process is kept busy with merging these n output runs on the fly, so things will bottleneck on that, which you saw in the example above. As already described, workers will sometimes merge in parallel, but only their own runs -- never another worker's runs. I did attempt to address the leader merge bottleneck by implementing cross-worker run merging in workers. I got as far as implementing a very rough version of this, but initial results were disappointing, and so that was not pursued further than the experimentation stage. Parallel merging is a possible future improvement that could be added to what I've come up with, but I don't think that it will move the needle in a really noticeable way. Partitioning for parallelism (samplesort style "bucketing") ----------------------------------------------------------- Perhaps a partition-based approach would be more effective than parallel merging (e.g., redistribute slices of worker runs across workers along predetermined partition boundaries, sort a range of values within dedicated workers, then concatenate to get final result, a bit like the in-memory samplesort algorithm). That approach would not suit CREATE INDEX, because the approach's great strength is that the workers can run in parallel for the entire duration, since there is no merge bottleneck (this assumes good partition boundaries, which is of a bit risky assumption). Parallel CREATE INDEX wants something where the workers can independently write the index, and independently WAL log, and independently create a unified set of internal pages, all of which is hard. This patch series will tend to proportionally speed up CREATE INDEX statements at a level that is comparable to other major database systems. That's enough progress for one release. I think that partitioning to sort is more useful for query execution than for utility statements like CREATE INDEX. Partitioning and merge joins ---------------------------- Robert has often speculated about what it would take to make merge joins work well in parallel. I think that "range distribution"/bucketing will prove an important component of that. It's just too useful to aggregate tuples in shared memory initially, and have workers sort them without any serial merge bottleneck; arguments about misestimations, data skew, and so on should not deter us from this, long term. This approach has minimal IPC overhead, especially with regard to LWLock contention. This kind of redistribution probably belongs in a Gather-like node, though, which has access to the context necessary to determine a range, and even dynamically alter the range in the event of a misestimation. Under this scheme, tuplesort.c just needs to be instructed that these worker-private Tuplesortstates are range-partitioned (i.e., the sorts are virtually independent, as far as it's concerned). That's a bit messy, but it is still probably the way to go for merge joins and other sort-reliant executor nodes. buffile.c, and "unification" ============================ There has been significant new infrastructure added to make logtape.c aware of workers. buffile.c has in turn been taught about unification as a first class part of the abstraction, with low-level management of certain details occurring within fd.c. So, "tape unification" within processes to open other backend's logical tapes to generate a unified logical tapeset for the leader to merge is added. This is probably the single biggest source of complexity for the patch, since I must consider: * Creating a general, reusable abstraction for other possible BufFile users (logtape.c only has to serve tuplesort.c, though). * Logical tape free space management. * Resource management, file lifetime, etc. fd.c resource management can now close a file at xact end for temp files, while not deleting it in the leader backend (only the "owning" worker backend deletes the temp file it owns). * Crash safety (e.g., when to truncate existing temp files, and when not to). CREATE INDEX user interface =========================== There are two ways of determine how many parallel workers a CREATE INDEX requests: * A cost model, which is closely based on create_plain_partial_paths() at the moment. This needs more work, particularly to model things like maintenance_work_mem. Even still, it isn't terrible. * A parallel_workers storage parameter, which completely bypasses the cost model. This is the "DBA knows best" approach, and is what I've consistently used during testing. Corey Huinker has privately assisted me with performance testing the patch, using his own datasets. Testing has exclusively used the storage parameter. I've added a new GUC, max_parallel_workers_maintenance, which is essentially the utility statement equivalent of max_parallel_workers_per_gather. This is clearly necessary, since we're using up to maintenance_work_mem per worker, which is of course typically much higher than work_mem. I didn't feel the need to create a new maintenance-wise variant GUC for things like min_parallel_relation_size, though. Only this one new GUC is added (plus the new storage parameter, parallel_workers, not to be confused with the existing table storage parameter of the same name). I am much more concerned about the tuplesort.h interface than the CREATE INDEX user interface as such. The user interface is merely a facade on top of tuplesort.c and nbtsort.c (and not one that I'm particularly attached to). -- Peter Geoghegan
Attachment
On Mon, Aug 1, 2016 at 6:18 PM, Peter Geoghegan <pg@heroku.com> wrote: > As some of you know, I've been working on parallel sort. I think I've > gone as long as I can without feedback on the design (and I see that > we're accepting stuff for September CF now), so I'd like to share what > I came up with. This project is something that I've worked on > inconsistently since late last year. It can be thought of as the > Postgres 10 follow-up to the 9.6 work on external sorting. I am glad that you are working on this. Just a first thought after reading the email: > As you can see, the parallel case is 2.58x faster (while using more > memory, though it's not the case that a higher maintenance_work_mem > setting speeds up the serial/baseline index build). 8 workers are a > bit faster than 4, but not by much (not shown). 16 are a bit slower, > but not by much (not shown). ... > I've seen cases where a CREATE INDEX ended up more than 3x faster, > though. I benchmarked this case in the interest of simplicity (the > serial case is intended to be comparable, making the test fair). > Encouragingly, as you can see from the trace_sort output, the 8 > parallel workers are 5.67x faster at getting to the final merge (a > merge that even it performs serially). Note that the final merge for > each CREATE INDEX is comparable (7 runs vs. 8 runs from each of 8 > workers). Not bad! I'm not going to say it's bad to be able to do things 2-2.5x faster, but linear scalability this ain't - particularly because your 2.58x faster case is using up to 7 or 8 times as much memory. The single-process case would be faster in that case, too: you could quicksort. I feel like for sorting, in particular, we probably ought to be setting the total memory budget, not the per-process memory budget. Or if not, then any CREATE INDEX benchmarking had better compare using scaled values for maintenance_work_mem; otherwise, you're measuring the impact of using more memory as much as anything else. I also think that Amdahl's law is going to pinch pretty severely here. If the final merge phase is a significant percentage of the total runtime, picking an algorithm that can't parallelize the final merge is going to limit the speedups to small multiples. That's an OK place to be as a result of not having done all the work yet, but you don't want to get locked into it. If we're going to have a substantial portion of the work that can never be parallelized, maybe we've picked the wrong algorithm. The work on making the logtape infrastructure parallel-aware seems very interesting and potentially useful for other things. Sadly, I don't have time to look at it right now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 3, 2016 at 11:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I'm not going to say it's bad to be able to do things 2-2.5x faster, > but linear scalability this ain't - particularly because your 2.58x > faster case is using up to 7 or 8 times as much memory. The > single-process case would be faster in that case, too: you could > quicksort. Certainly, there are cases where a parallel version could benefit from having more memory more so than actually parallelizing the underlying task. However, this case was pointedly chosen to *not* be such a case. When maintenance_work_mem exceeds about 5GB, I've observed that since 9.6 increasing it is just as likely to hurt as to help by about +/-5% (unless and until it's all in memory, which still doesn't help much). In general, there isn't all that much point in doing a very large sort like this in memory. You just don't get that much of a benefit for the memory you use, because linearithmic CPU costs eventually really dominate linear sequential I/O costs. I think you're focusing on the fact that there is a large absolute disparity in memory used in this one benchmark, but that isn't something that the gains shown particularly hinge upon. There isn't that much difference when workers must merge their own runs, for example. It saves the serial leader merge some work, and in particular makes it more cache efficient (by having fewer runs/tapes). Finally, while about 8x as much memory is used, the memory used over and above the serial case is almost all freed when the final merge begins (the final merges are therefore very similar in both cases, including in terms of memory use). So, for as long as you use 8x as much memory for 8 active processes, you get a 5.67x speed-up of that part alone. You still keep a few extra KiBs of memory for worker tapes and things like that during the leader's merge, but that's a close to negligible amount. > I feel like for sorting, in particular, we probably ought > to be setting the total memory budget, not the per-process memory > budget. Or if not, then any CREATE INDEX benchmarking had better > compare using scaled values for maintenance_work_mem; otherwise, > you're measuring the impact of using more memory as much as anything > else. As I said, the benchmark was chosen to avoid that (and to be simple and reproducible). I am currently neutral on the question of whether or not maintenance_work_mem should be dolled out per process or per sort operation. I do think that making it a per-process allowance is far closer to what we do for hash joins today, and is simpler. What's nice about the idea of making the workMem/maintenance_work_mem budget per sort is that that leaves the leader process with license to greatly increase the amount of memory it can use for the merge. Increasing the amount of memory used for the merge will improve things for longer than it will for workers. I've simulated it already. > I also think that Amdahl's law is going to pinch pretty severely here. Doesn't that almost always happen, though? Isn't that what you generally see with queries that show off the parallel join capability? > If the final merge phase is a significant percentage of the total > runtime, picking an algorithm that can't parallelize the final merge > is going to limit the speedups to small multiples. That's an OK place > to be as a result of not having done all the work yet, but you don't > want to get locked into it. If we're going to have a substantial > portion of the work that can never be parallelized, maybe we've picked > the wrong algorithm. I suggest that this work be compared to something with similar constraints. I used Google to try to get some indication of how much of a difference parallel CREATE INDEX makes in other major database systems. This is all I could find: https://www.mssqltips.com/sqlservertip/3100/reduce-time-for-sql-server-index-rebuilds-and-update-statistics/ It seems like the degree of parallelism used for SQL Server tends to affect index build time in a way that is strikingly similar with what I've come up with (which may be a coincidence; I don't know anything about SQL Server). So, I suspect that the performance of this is fairly good in an apples-to-apples comparison. Parallelizing merging can hurt or help, because there is a cost in memory bandwidth (if not I/O) for the extra passes that are used to keep more CPUs busy, which is kind of analogous to the situation with polyphase merge. I'm not saying that we shouldn't do that even still, but I believe that there are sharply diminishing returns. Merging tuple comparisons are much more expensive than quicksort tuple comparisons, which tend to benefit from abbreviated keys a lot. As I've said, there is probably a good argument to be made for partitioning to increase parallelism. But, that involves risks around the partitioning being driven by statistics or a cost model, and I don't think you'd be too on board with the idea of every CREATE INDEX after bulk loading needing an ANALYZE first. I tend to think of that as more of a parallel query thing, because you can often push down a lot more there, dynamic sampling might be possible, and there isn't a need to push all the tuples through one point in the end. Nothing I've done here precludes your idea of a sort-order-preserving gather node. I think that we may well need both. Since merging is a big bottleneck with this, we should probably also work to address that indirectly. > The work on making the logtape infrastructure parallel-aware seems > very interesting and potentially useful for other things. Sadly, I > don't have time to look at it right now. I would be happy to look at generalizing that further, to help parallel hash join. As you know, Thomas Munro and I have discussed this privately. -- Peter Geoghegan
On Wed, Aug 3, 2016 at 5:13 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Aug 3, 2016 at 11:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I'm not going to say it's bad to be able to do things 2-2.5x faster, >> but linear scalability this ain't - particularly because your 2.58x >> faster case is using up to 7 or 8 times as much memory. The >> single-process case would be faster in that case, too: you could >> quicksort. > > [ lengthy counter-argument ] None of this convinces me that testing this in a way that is not "apples to apples" is a good idea, nor will any other argument. >> I also think that Amdahl's law is going to pinch pretty severely here. > > Doesn't that almost always happen, though? To some extent, sure, absolutely. But it's our job as developers to try to foresee and minimize those cases. When Noah was at EnterpriseDB a few years ago and we were talking about parallel internal sort, Noah started by doing a survey of the literature and identified parallel quicksort as the algorithm that seemed best for our use case. Of course, every time quicksort partitions the input, you get two smaller sorting problems, so it's easy to see how to use 2 CPUs after the initial partitioning step has been completed and 4 CPUs after each of those partitions has been partitioned again, and so on. However, that turns out not to be good enough because the first partitioning step can consume a significant percentage of the total runtime - so if you only start parallelizing after that, you're leaving too much on the table. To avoid that, the algorithm he was looking at had a (complicated) way of parallelizing the first partitioning step; then you can, it seems, do the full sort in parallel. There are some somewhat outdated and perhaps naive ideas about this that we wrote up here: https://wiki.postgresql.org/wiki/Parallel_Sort Anyway, you're proposing an algorithm that can't be fully parallelized. Maybe that's OK. But I'm a little worried about it. I'd feel more confident if we knew that the merge could be done in parallel and were just leaving that to a later development stage; or if we picked an algorithm like the one above that doesn't leave a major chunk of the work unparallelizable. > Isn't that what you > generally see with queries that show off the parallel join capability? For nested loop joins, no. The whole join operation can be done in parallel. For hash joins, yes: building the hash table once per worker can run afoul of Amdahl's law in a big way. That's why Thomas Munro is working on fixing it: https://wiki.postgresql.org/wiki/EnterpriseDB_database_server_roadmap Obviously, parallel query is subject to a long list of annoying restrictions at this point. On queries that don't hit any of those restrictions we can get 4-5x speedup with a leader and 4 workers. As we expand the range of plan types that we can construct, I think we'll see those kinds of speedups for a broader range of queries. (The question of exactly why we top out with as few workers as currently seems to be the case needs more investigation, too; maybe contention effects?) >> If the final merge phase is a significant percentage of the total >> runtime, picking an algorithm that can't parallelize the final merge >> is going to limit the speedups to small multiples. That's an OK place >> to be as a result of not having done all the work yet, but you don't >> want to get locked into it. If we're going to have a substantial >> portion of the work that can never be parallelized, maybe we've picked >> the wrong algorithm. > > I suggest that this work be compared to something with similar > constraints. I used Google to try to get some indication of how much > of a difference parallel CREATE INDEX makes in other major database > systems. This is all I could find: > > https://www.mssqltips.com/sqlservertip/3100/reduce-time-for-sql-server-index-rebuilds-and-update-statistics/ I do agree that it is important not to have unrealistic expectations. > As I've said, there is probably a good argument to be made for > partitioning to increase parallelism. But, that involves risks around > the partitioning being driven by statistics or a cost model, and I > don't think you'd be too on board with the idea of every CREATE INDEX > after bulk loading needing an ANALYZE first. I tend to think of that > as more of a parallel query thing, because you can often push down a > lot more there, dynamic sampling might be possible, and there isn't a > need to push all the tuples through one point in the end. Nothing I've > done here precludes your idea of a sort-order-preserving gather node. > I think that we may well need both. Yes. Rushabh is working on that, and Finalize GroupAggregate -> Gather Merge -> Partial GroupAggregate -> Sort -> whatever is looking pretty sweet. >> The work on making the logtape infrastructure parallel-aware seems >> very interesting and potentially useful for other things. Sadly, I >> don't have time to look at it right now. > > I would be happy to look at generalizing that further, to help > parallel hash join. As you know, Thomas Munro and I have discussed > this privately. Right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 5, 2016 at 9:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: > To some extent, sure, absolutely. But it's our job as developers to > try to foresee and minimize those cases. When Noah was at > EnterpriseDB a few years ago and we were talking about parallel > internal sort, Noah started by doing a survey of the literature and > identified parallel quicksort as the algorithm that seemed best for > our use case. Of course, every time quicksort partitions the input, > you get two smaller sorting problems, so it's easy to see how to use 2 > CPUs after the initial partitioning step has been completed and 4 CPUs > after each of those partitions has been partitioned again, and so on. > However, that turns out not to be good enough because the first > partitioning step can consume a significant percentage of the total > runtime - so if you only start parallelizing after that, you're > leaving too much on the table. To avoid that, the algorithm he was > looking at had a (complicated) way of parallelizing the first > partitioning step; then you can, it seems, do the full sort in > parallel. > > There are some somewhat outdated and perhaps naive ideas about this > that we wrote up here: > > https://wiki.postgresql.org/wiki/Parallel_Sort I'm familiar with that effort. I think that when research topics like sorting, it can sometimes be a mistake to not look at an approach specifically recommended by the database research community. A lot of the techniques we've benefited from within tuplesort.c have been a matter of addressing memory latency as a bottleneck; techniques that are fairly simple and not worth writing a general interest paper on. Also, things like abbreviated keys are beneficial in large part because people tend to follow the first normal form, and therefore an abbreviated key can contain a fair amount of entropy most of the time. Similarly, Radix sort seems really cool, but our requirements around generality seem to make it impractical. > Anyway, you're proposing an algorithm that can't be fully > parallelized. Maybe that's OK. But I'm a little worried about it. > I'd feel more confident if we knew that the merge could be done in > parallel and were just leaving that to a later development stage; or > if we picked an algorithm like the one above that doesn't leave a > major chunk of the work unparallelizable. I might be able to resurrect the parallel merge stuff, just to guide reviewer intuition on how much that can help or hurt. I can probably repurpose it to show you the mixed picture on how effective it is. I think it might help more with collatable text that doesn't have abbreviated keys, for example, because you can use more of the machines memory bandwidth for longer. But for integers, it can hurt. (That's my recollection; I prototyped parallel merge a couple of months ago now.) >> Isn't that what you >> generally see with queries that show off the parallel join capability? > > For nested loop joins, no. The whole join operation can be done in > parallel. Sure, I know, but I'm suggesting that laws-of-physics problems may still be more significant than implementation deficiencies, even though those deficiencies should need to be stamped out. Linear scalability is really quite rare for most database workloads. > Obviously, parallel query is subject to a long list of annoying > restrictions at this point. On queries that don't hit any of those > restrictions we can get 4-5x speedup with a leader and 4 workers. As > we expand the range of plan types that we can construct, I think we'll > see those kinds of speedups for a broader range of queries. (The > question of exactly why we top out with as few workers as currently > seems to be the case needs more investigation, too; maybe contention > effects?) You're probably bottlenecked on memory bandwidth. Note that I showed improvements with 8 workers, not 4. 4 Workers are slower than 8, but not by that much. >> https://www.mssqltips.com/sqlservertip/3100/reduce-time-for-sql-server-index-rebuilds-and-update-statistics/ > > I do agree that it is important not to have unrealistic expectations. Great. My ambition for this patch is that it put parallel CREATE INDEX on a competitive footing against the implementations featured in other major systems. I don't think we need to do everything at once, but I have no intention of pushing forward with something that doesn't do respectably there. I also want to avoid partitioning in the first version of this, and probably in any version that backs CREATE INDEX. I've only made minimal changes to the tuplesort.h interface here to support parallelism. That flexibility counts for a lot, IMV. >> As I've said, there is probably a good argument to be made for >> partitioning to increase parallelism. But, that involves risks around >> the partitioning being driven by statistics or a cost model > Yes. Rushabh is working on that, and Finalize GroupAggregate -> > Gather Merge -> Partial GroupAggregate -> Sort -> whatever is looking > pretty sweet. A "Gather Merge" node doesn't really sound like what I'm talking about. Isn't that something to do with table-level partitioning? I'm talking about dynamic partitioning, typically of a single table, of course. >>> The work on making the logtape infrastructure parallel-aware seems >>> very interesting and potentially useful for other things. Sadly, I >>> don't have time to look at it right now. >> >> I would be happy to look at generalizing that further, to help >> parallel hash join. As you know, Thomas Munro and I have discussed >> this privately. > > Right. By the way, the patch is in better shape from that perspective, as compared to the early version Thomas (CC'd) had access to. The BufFile stuff is now credible as a general-purpose abstraction. -- Peter Geoghegan
On Sat, Aug 6, 2016 at 2:16 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Aug 5, 2016 at 9:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> There are some somewhat outdated and perhaps naive ideas about this >> that we wrote up here: >> >> https://wiki.postgresql.org/wiki/Parallel_Sort > > I'm familiar with that effort. I think that when research topics like > sorting, it can sometimes be a mistake to not look at an approach > specifically recommended by the database research community. A lot of > the techniques we've benefited from within tuplesort.c have been a > matter of addressing memory latency as a bottleneck; techniques that > are fairly simple and not worth writing a general interest paper on. > Also, things like abbreviated keys are beneficial in large part > because people tend to follow the first normal form, and therefore an > abbreviated key can contain a fair amount of entropy most of the time. > Similarly, Radix sort seems really cool, but our requirements around > generality seem to make it impractical. > >> Anyway, you're proposing an algorithm that can't be fully >> parallelized. Maybe that's OK. But I'm a little worried about it. >> I'd feel more confident if we knew that the merge could be done in >> parallel and were just leaving that to a later development stage; or >> if we picked an algorithm like the one above that doesn't leave a >> major chunk of the work unparallelizable. > > I might be able to resurrect the parallel merge stuff, just to guide > reviewer intuition on how much that can help or hurt. > I think here some of the factors like how many workers will be used for merge phase might impact the performance. Having too many workers can lead to more communication cost and having too few workers might not yield best results for merge. One thing, I have noticed that in general for sorting, some of the other databases uses range partitioning [1], now that might not be what is good for us. I see you mentioned above that why it is not good [2], but I don't understand why you think it is a risky assumption to assume good partition boundaries for parallelizing sort. [1] - https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm Refer Producer or Consumer Operations section. [2] - "That approach would not suit CREATE INDEX, because the approach's great strength is that the workers can run in parallel for the entire duration, since there is no merge bottleneck (this assumes good partition boundaries, which is of a bit risky assumption)" -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Aug 6, 2016 at 6:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think here some of the factors like how many workers will be used > for merge phase might impact the performance. Having too many > workers can lead to more communication cost and having too few workers > might not yield best results for merge. One thing, I have noticed > that in general for sorting, some of the other databases uses range > partitioning [1], now that might not be what is good for us. I don't disagree with anything you say here. I acknowledged that partitioning will probably be important for sorting in my introductory e-mail, after all. > I see > you mentioned above that why it is not good [2], but I don't > understand why you think it is a risky assumption to assume good > partition boundaries for parallelizing sort. Well, apparently there are numerous problems with partitioning in systems like SQL Server and Oracle in the worst case. For one thing, in the event of a misestimation (or failure of the dynamic sampling that I presume can sometimes be used), workers can be completely starved of work for the entire duration of the sort. And for CREATE INDEX to get much of any benefit, all workers must write their part of the index independently, too. This can affect the physical structure of the final index. SQL Server also has a caveat in its documentation about this resulting in an unbalanced final index, which I imagine could be quite bad in the worst case. I believe that it's going to be hard to get any version of this that writes the index simultaneously in each worker accepted for these reasons. This patch I came up with isn't very different from the serial case at all. Any index built in parallel by the patch ought to have relfilenode files on the filesystem that are 100% identical to those produced by the serial case, in fact (since CREATE INDEX does not set LSNs in the new index pages). I've actually developed a simple way of "fingerprinting" indexes during testing of this patch, knowing that hashing the files on disk ought to produce a perfect match compared to a master branch serial sort case. At the same time, any information that I've seen about how much parallel CREATE INDEX speeds things up in these other systems indicates that the benefits are very similar. It tends to be in the 2x - 3x range, with the same reduction in throughput seen at about 16 workers, after we peak at about 8 workers. So, I think that the benefits of partitioning are not really seen with CREATE INDEX (I think of partitioning as more of a parallel query thing). Obviously, any benefit that might still exist for CREATE INDEX in particular, when weighed against the costs, makes partitioning look pretty unattractive as a next step. I think that during the merge phase of parallel CREATE INDEX as implemented, the system generally still isn't that far from being I/O bound. Whereas, with parallel query, partitioning makes each worker able to return one tuple from its own separated range very quickly, not just one worker (presumably, each worker merges non-overlapping "ranges" from runs initially sorted in each worker. Each worker subsequently merges after a partition-wise redistribution of the initial fully sorted runs, allowing for dynamic sampling to optimize the actual range used for load balancing.). The workers can then do more CPU-bound processing in whatever node is fed by each worker's ranged merge; everything is kept busy. That's the approach that I personally had in mind for partitioning, at least. It's really nice for parallel query to be able to totally separate workers after the point of redistribution. CREATE INDEX is not far from being I/O bound anyway, though, so it benefits far less. (Consider how fast the merge phase still is at writing out the index in *absolute* terms.) Look at figure 9 in this paper: http://www.vldb.org/pvldb/vol7/p85-balkesen.pdf Even in good cases for "independent sorting", there is only a benefit seen at 8 cores. At the same time, I can only get about 6x scaling with 8 workers, just for the initial generation of runs. All of these factors are why I believe I'm able to compete well with other systems with this relatively straightforward, evolutionary approach. I have a completely open mind about partitioning, but my approach makes sense in this context. -- Peter Geoghegan
On Wed, Aug 3, 2016 at 2:13 PM, Peter Geoghegan <pg@heroku.com> wrote: > Since merging is a big bottleneck with this, we should probably also > work to address that indirectly. I attach a patch that changes how we maintain the heap invariant during tuplesort merging. I already mentioned this over on the "Parallel tuplesort, partitioning, merging, and the future" thread. As noted already on that thread, this patch makes merging clustered numeric input about 2.1x faster overall in one case, which is particularly useful in the context of a serial final/leader merge during a parallel CREATE INDEX. Even *random* non-C-collated text input is made significantly faster. This work is totally orthogonal to parallelism, though; it's just very timely, given our discussion of the merge bottleneck on this thread. If I benchmark a parallel build of a 100 million row index, with presorted input, I can see a 71% reduction in *comparisons* with 8 tapes/workers, and an 80% reduction in comparisons with 16 workers/tapes in one instance (the numeric case I just mentioned). With random input, we can still come out significantly ahead, but not to the same degree. I was able to see a reduction in comparisons during a leader merge, from 1,468,522,397 comparisons to 999,755,569 comparisons, which is obviously still quite significant (worker merges, if any, benefit too). I think I need to redo my parallel CREATE INDEX benchmark, so that you can take this into account. Also, I think that this patch will make very large external sorts that naturally have tens of runs to merge significantly faster, but I didn't bother to benchmark that. The patch is intended to be applied on top of parallel B-Tree patches 0001-* and 0002-* [1]. I happened to test it with parallelism, but these are all independently useful, and will be entered as a separate CF entry (perhaps better to commit the earlier two patches first, to avoid merge conflicts). I'm optimistic that we can get those 3 patches in the series out of the way early, without blocking on discussing parallel sort. The patch makes tuplesort merging shift down and displace the root tuple with the tape's next preread tuple, rather than compacting and then inserting into the heap anew. This approach to maintaining the heap as tuples are returned to caller will always produce fewer comparisons overall. The new approach is also simpler. We were already shifting down to compact the heap within the misleadingly named [2] function tuplesort_heap_siftup() -- why not instead just use the caller tuple (the tuple that we currently go on to insert) when initially shifting down (not the heap's preexisting last tuple, which is guaranteed to go straight to the leaf level again)? That way, we don't need to enlarge the heap at all through insertion, shifting up, etc. We're done, and are *guaranteed* to have performed less work (fewer comparisons and swaps) than with the existing approach (this is the reason for my optimism about getting this stuff out of the way early). This new approach is more or less the *conventional* way to maintain the heap invariant when returning elements from a heap during k-way merging. Our existing approach is convoluted; merging was presumably only coded that way because the generic functions tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be available. Perhaps the problem was masked by unrelated bottlenecks that existed at the time, too. I think that I could push this further (a minimum of 2 comparisons per item returned when 3 or more tapes are active still seems like 1 comparison too many), but what I have here gets us most of the benefit. And, it does so while not actually adding code that could be called "overly clever", IMV. I'll probably leave clever, aggressive optimization of merging for a later release. [1] https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com [2] https://www.postgresql.org/message-id/CAM3SWZQ+2gJMNV7ChxwEXqXopLfb_FEW2RfEXHJ+GsYF39f6MQ@mail.gmail.com -- Peter Geoghegan
Attachment
On Mon, Aug 1, 2016 at 3:18 PM, Peter Geoghegan <pg@heroku.com> wrote: > Attached WIP patch series: This has bitrot, since commit da1c9163 changed the interface for checking parallel safety. I'll have to fix that, and will probably take the opportunity to change how workers have maintenance_work_mem apportioned while I'm at it. To recap, it would probably be better if maintenance_work_mem remained a high watermark for the entire CREATE INDEX, rather than applying as a per-worker allowance. -- Peter Geoghegan
On 08/16/2016 03:33 AM, Peter Geoghegan wrote: > I attach a patch that changes how we maintain the heap invariant > during tuplesort merging. I already mentioned this over on the > "Parallel tuplesort, partitioning, merging, and the future" thread. As > noted already on that thread, this patch makes merging clustered > numeric input about 2.1x faster overall in one case, which is > particularly useful in the context of a serial final/leader merge > during a parallel CREATE INDEX. Even *random* non-C-collated text > input is made significantly faster. This work is totally orthogonal to > parallelism, though; it's just very timely, given our discussion of > the merge bottleneck on this thread. Nice! > The patch makes tuplesort merging shift down and displace the root > tuple with the tape's next preread tuple, rather than compacting and > then inserting into the heap anew. This approach to maintaining the > heap as tuples are returned to caller will always produce fewer > comparisons overall. The new approach is also simpler. We were already > shifting down to compact the heap within the misleadingly named [2] > function tuplesort_heap_siftup() -- why not instead just use the > caller tuple (the tuple that we currently go on to insert) when > initially shifting down (not the heap's preexisting last tuple, which > is guaranteed to go straight to the leaf level again)? That way, we > don't need to enlarge the heap at all through insertion, shifting up, > etc. We're done, and are *guaranteed* to have performed less work > (fewer comparisons and swaps) than with the existing approach (this is > the reason for my optimism about getting this stuff out of the way > early). Makes sense. > This new approach is more or less the *conventional* way to maintain > the heap invariant when returning elements from a heap during k-way > merging. Our existing approach is convoluted; merging was presumably > only coded that way because the generic functions > tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be > available. Perhaps the problem was masked by unrelated bottlenecks > that existed at the time, too. Yeah, this seems like a very obvious optimization. Is there a standard name for this technique in the literature? I'm OK with "displace", or perhaps just "replace" or "siftup+insert", but if there's a standard name for this, let's use that. - Heikki
I'm reviewing patches 1-3 in this series, i.e. those patches that are not directly related to parallelism, but are independent improvements to merging. Let's begin with patch 1: On 08/02/2016 01:18 AM, Peter Geoghegan wrote: > Cap the number of tapes used by external sorts > > Commit df700e6b set merge order based on available buffer space (the > number of tapes was as high as possible while still allowing at least 32 > * BLCKSZ buffer space per tape), rejecting Knuth's theoretically > justified "sweet spot" of 7 tapes (a merge order of 6 -- Knuth's P), > improving performance when the sort thereby completed in one pass. > However, it's still true that there are unlikely to be benefits from > increasing the number of tapes past 7 once the amount of data to be > sorted significantly exceeds available memory; that commit probably > mostly just improved matters where it enabled all merging to be done in > a final on-the-fly merge. > > One problem with the merge order logic established by that commit is > that with large work_mem settings and data volumes, the tapes previously > wasted as much as 8% of the available memory budget; tens of thousands > of tapes could be logically allocated for a sort that will only benefit > from a few dozen. Yeah, wasting 8% of the memory budget on this seems like a bad idea. If I understand correctly, that makes the runs shorter than necessary, leading to more runs. > A new quasi-arbitrary cap of 501 is applied on the number of tapes that > tuplesort will ever use (i.e. merge order is capped at 500 inclusive). > This is a conservative estimate of the number of runs at which doing all > merging on-the-fly no longer allows greater overlapping of I/O and > computation. Hmm. Surely there are cases, so that with > 501 tapes you could do it with one merge pass, but now you need two? And that would hurt performance, no? Why do we reserve the buffer space for all the tapes right at the beginning? Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) callin inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every time we start a new run, until we reach maxTapes? - Heikki
On Tue, Sep 6, 2016 at 12:08 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> I attach a patch that changes how we maintain the heap invariant >> during tuplesort merging. > Nice! Thanks! >> This new approach is more or less the *conventional* way to maintain >> the heap invariant when returning elements from a heap during k-way >> merging. Our existing approach is convoluted; merging was presumably >> only coded that way because the generic functions >> tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be >> available. Perhaps the problem was masked by unrelated bottlenecks >> that existed at the time, too. > > > Yeah, this seems like a very obvious optimization. Is there a standard name > for this technique in the literature? I'm OK with "displace", or perhaps > just "replace" or "siftup+insert", but if there's a standard name for this, > let's use that. I used the term "displace" specifically because it wasn't a term with a well-defined meaning in the context of the analysis of algorithms. Just like "insert" isn't for tuplesort_heap_insert(). I'm not particularly attached to the name tuplesort_heap_root_displace(), but I do think that whatever it ends up being called should at least not be named after an implementation detail. For example, tuplesort_heap_root_replace() also seems fine. I think that tuplesort_heap_siftup() should be called something like tuplesort_heap_compact instead [1], since what it actually does (shifting down -- the existing name is completely backwards!) is just an implementation detail involved in compacting the heap (notice that it decrements memtupcount, which, by now, means the k-way merge heap gets one element smaller). I can write a patch to do this renaming, if you're interested. Someone should fix it, because independent of all this, it's just wrong. [1] https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com -- Peter Geoghegan
On Tue, Sep 6, 2016 at 12:39 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Sep 6, 2016 at 12:08 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> I attach a patch that changes how we maintain the heap invariant >>> during tuplesort merging. > >> Nice! > > Thanks! BTW, the way that k-way merging is made more efficient by this approach makes the case for replacement selection even weaker than it was just before we almost killed it. I hate to say it, but I have to wonder if we shouldn't get rid of the new-to-9.6 replacement_sort_tuples because of this, and completely kill replacement selection. I'm not going to go on about it, but that seems sensible to me. -- Peter Geoghegan
On Mon, Aug 15, 2016 at 9:33 PM, Peter Geoghegan <pg@heroku.com> wrote: > The patch is intended to be applied on top of parallel B-Tree patches > 0001-* and 0002-* [1]. I happened to test it with parallelism, but > these are all independently useful, and will be entered as a separate > CF entry (perhaps better to commit the earlier two patches first, to > avoid merge conflicts). I'm optimistic that we can get those 3 patches > in the series out of the way early, without blocking on discussing > parallel sort. Applied patches 1 and 2, builds fine, regression tests run fine. It was a prerequisite to reviewing patch 3 (which I'm going to do below), so I thought I might as well report on that tidbit of info, fwiw. > The patch makes tuplesort merging shift down and displace the root > tuple with the tape's next preread tuple, rather than compacting and > then inserting into the heap anew. This approach to maintaining the > heap as tuples are returned to caller will always produce fewer > comparisons overall. The new approach is also simpler. We were already > shifting down to compact the heap within the misleadingly named [2] > function tuplesort_heap_siftup() -- why not instead just use the > caller tuple (the tuple that we currently go on to insert) when > initially shifting down (not the heap's preexisting last tuple, which > is guaranteed to go straight to the leaf level again)? That way, we > don't need to enlarge the heap at all through insertion, shifting up, > etc. We're done, and are *guaranteed* to have performed less work > (fewer comparisons and swaps) than with the existing approach (this is > the reason for my optimism about getting this stuff out of the way > early). Patch 3 applies fine to git master as of 25794e841e5b86a0f90fac7f7f851e5d950e51e2 (on top of patches 1 and 2). Builds fine and without warnings on gcc 4.8.5 AFAICT, regression test suite runs without issues as well. Patch lacks any new tests, but the changed code paths seem covered sufficiently by existing tests. A little bit of fuzzing on the patch itself, like reverting some key changes, or flipping some key comparisons, induces test failures as it should, mostly in cluster. The logic in tuplesort_heap_root_displace seems sound, except: + */ + memtuples[i] = memtuples[imin]; + i = imin; + } + + Assert(state->memtupcount > 1 || imin == 0); + memtuples[imin] = *newtup; +} Why that assert? Wouldn't it make more sense to Assert(imin < n) ? In the meanwhile, I'll go and do some perf testing. Assuming the speedup is realized during testing, LGTM.
On Tue, Sep 6, 2016 at 12:34 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I'm reviewing patches 1-3 in this series, i.e. those patches that are not > directly related to parallelism, but are independent improvements to > merging. That's fantastic! Thanks! I'm really glad you're picking those ones up. I feel that I'm far too dependent on Robert's review for this stuff. That shouldn't be taken as a statement against Robert -- it's intended as quite the opposite -- but it's just personally difficult to rely on exactly one other person for something that I've put so much work into. Robert has been involved with 100% of all sorting patches I've written, generally with far less input from anyone else, and at this point, that's really rather a lot of complex patches. > Let's begin with patch 1: > > On 08/02/2016 01:18 AM, Peter Geoghegan wrote: >> >> Cap the number of tapes used by external sorts > Yeah, wasting 8% of the memory budget on this seems like a bad idea. If I > understand correctly, that makes the runs shorter than necessary, leading to > more runs. Right. Quite simply, whatever you could have used the workMem for prior to the merge step, now you can't. It's not so bad during the merge step of a final on-the-fly merge (or, with the 0002-* patch, any final merge), since you can get a "refund" of unused (though logically allocated by USEMEM()) tapes to grow memtuples with (other overhead forms the majority of the refund, though). That still isn't much consolation to the user, because run generation is typically much more expensive (we really just refund unused tapes because it's easy). >> A new quasi-arbitrary cap of 501 is applied on the number of tapes that >> tuplesort will ever use (i.e. merge order is capped at 500 inclusive). >> This is a conservative estimate of the number of runs at which doing all >> merging on-the-fly no longer allows greater overlapping of I/O and >> computation. > > > Hmm. Surely there are cases, so that with > 501 tapes you could do it with > one merge pass, but now you need two? And that would hurt performance, no? In theory, yes, that could be true, and not just for my proposed new cap of 500 for merge order (501 tapes), but for any such cap. I noticed that the Greenplum tuplesort.c uses a max of 250, so I guess I just thought that to double that. Way back in 2006, Tom and Simon talked about a cap too on several occasions, but I think that that was in the thousands then. Hundreds of runs are typically quite rare. It isn't that painful to do a second pass, because the merge process may be more CPU cache efficient as a result, which tends to be the dominant cost these days (over and above the extra I/O that an extra pass requires). This seems like a very familiar situation to me: I pick a quasi-arbitrary limit or cap for something, and it's not clear that it's optimal. Everyone more or less recognizes the need for such a cap, but is uncomfortable about the exact figure chosen, not because it's objectively bad, but because it's clearly something pulled from the air, to some degree. It may not make you feel much better about it, but I should point out that I've read a paper that claims "Modern servers of the day have hundreds of GB operating memory and tens of TB storage capacity. Hence, if the sorted data fit the persistent storage, the first phase will generate hundreds of runs at most." [1]. Feel free to make a counter-proposal for a cap. I'm not attached to 500. I'm mostly worried about blatant waste with very large workMem sizings. Tens of thousands of tapes is just crazy. The amount of data that you need to have as input is very large when workMem is big enough for this new cap to be enforced. > Why do we reserve the buffer space for all the tapes right at the beginning? > Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) callin > inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every time we > start a new run, until we reach maxTapes? No, because then you have no way to clamp back memory, which is now almost all used (we hold off from making LACKMEM() continually true, if at all possible, which is almost always the case). You can't really continually shrink memtuples to make space for new tapes, which is what it would take. [1] http://ceur-ws.org/Vol-1343/paper8.pdf -- Peter Geoghegan
On Tue, Sep 6, 2016 at 2:46 PM, Peter Geoghegan <pg@heroku.com> wrote: > Feel free to make a counter-proposal for a cap. I'm not attached to > 500. I'm mostly worried about blatant waste with very large workMem > sizings. Tens of thousands of tapes is just crazy. The amount of data > that you need to have as input is very large when workMem is big > enough for this new cap to be enforced. If tuplesort callers passed a hint about the number of tuples that would ultimately be sorted, and (for the sake of argument) it was magically 100% accurate, then theoretically we could just allocate the right number of tapes up-front. That discussion is a big can of worms, though. There are of course obvious disadvantages that come with a localized cost model, even if you're prepared to add some "slop" to the allocation size or whatever. -- Peter Geoghegan
On Tue, Sep 6, 2016 at 12:57 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > Patch lacks any new tests, but the changed code paths seem covered > sufficiently by existing tests. A little bit of fuzzing on the patch > itself, like reverting some key changes, or flipping some key > comparisons, induces test failures as it should, mostly in cluster. > > The logic in tuplesort_heap_root_displace seems sound, except: > > + */ > + memtuples[i] = memtuples[imin]; > + i = imin; > + } > + > + Assert(state->memtupcount > 1 || imin == 0); > + memtuples[imin] = *newtup; > +} > > Why that assert? Wouldn't it make more sense to Assert(imin < n) ? There might only be one or two elements in the heap. Note that the heap size is indicated by state->memtupcount at this point in the sort, which is a little confusing (that differs from how memtupcount is used elsewhere, where we don't partition memtuples into a heap portion and a preread tuples portion, as we do here). > In the meanwhile, I'll go and do some perf testing. > > Assuming the speedup is realized during testing, LGTM. Thanks. I suggest spending at least as much time on unsympathetic cases (e.g., only 2 or 3 tapes must be merged). At the same time, I suggest focusing on a type that has relatively expensive comparisons, such as collated text, to make differences clearer. -- Peter Geoghegan
On Tue, Sep 6, 2016 at 8:28 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Sep 6, 2016 at 12:57 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Patch lacks any new tests, but the changed code paths seem covered >> sufficiently by existing tests. A little bit of fuzzing on the patch >> itself, like reverting some key changes, or flipping some key >> comparisons, induces test failures as it should, mostly in cluster. >> >> The logic in tuplesort_heap_root_displace seems sound, except: >> >> + */ >> + memtuples[i] = memtuples[imin]; >> + i = imin; >> + } >> + >> + Assert(state->memtupcount > 1 || imin == 0); >> + memtuples[imin] = *newtup; >> +} >> >> Why that assert? Wouldn't it make more sense to Assert(imin < n) ? > > There might only be one or two elements in the heap. Note that the > heap size is indicated by state->memtupcount at this point in the > sort, which is a little confusing (that differs from how memtupcount > is used elsewhere, where we don't partition memtuples into a heap > portion and a preread tuples portion, as we do here). I noticed, but here n = state->memtupcount + Assert(memtuples[0].tupindex == newtup->tupindex); + + CHECK_FOR_INTERRUPTS(); + + n = state->memtupcount; /* n is heap's size, including old root */ + imin = 0; /* start with caller's "hole" in root */ + i = imin; In fact, the assert on the patch would allow writing memtuples outside the heap, as in calling tuplesort_heap_root_displace if memtupcount==0, but I don't think that should be legal (memtuples[0] == memtuples[imin] would be outside the heap). Sure, that's a weird enough case (that assert up there already reads memtuples[0] which would be equally illegal if memtupcount==0), but it goes on to show that the assert expression just seems odd for its intent. BTW, I know it's not the scope of the patch, but shouldn't root_displace be usable on the TSS_BOUNDED phase?
On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > I noticed, but here n = state->memtupcount > > + Assert(memtuples[0].tupindex == newtup->tupindex); > + > + CHECK_FOR_INTERRUPTS(); > + > + n = state->memtupcount; /* n is heap's size, > including old root */ > + imin = 0; /* > start with caller's "hole" in root */ > + i = imin; I'm fine with using "n" in the later assertion you mentioned, if that's clearer to you. memtupcount is broken out as "n" simply because that's less verbose, in a place where that makes things far clearer. > In fact, the assert on the patch would allow writing memtuples outside > the heap, as in calling tuplesort_heap_root_displace if > memtupcount==0, but I don't think that should be legal (memtuples[0] > == memtuples[imin] would be outside the heap). You have to have a valid heap (i.e. there must be at least one element) to call tuplesort_heap_root_displace(), and it doesn't directly compact the heap, so it must remain valid on return. The assertion exists to make sure that everything is okay with a one-element heap, a case which is quite possible. If you want to see a merge involving one input tape, apply the entire parallel CREATE INDEX patch set, set "force_parallal_mode = regress", and note that the leader merge merges only 1 input tape, making the heap only ever contain one element. In general, most use of the heap for k-way merging will eventually end up as a one element heap, at the very end. Maybe that assertion you mention is overkill, but I like to err on the side of overkill with assertions. It doesn't seem that important, though. > Sure, that's a weird enough case (that assert up there already reads > memtuples[0] which would be equally illegal if memtupcount==0), but it > goes on to show that the assert expression just seems odd for its > intent. > > BTW, I know it's not the scope of the patch, but shouldn't > root_displace be usable on the TSS_BOUNDED phase? I don't think it should be, no. With a top-n heap sort, the expectation is that after a little while, we can immediately determine that most tuples do not belong in the heap (this will require more than one comparison per tuple when the tuple that may be entered into the heap will in fact go in the heap, which should be fairly rare after a time). That's why that general strategy can be so much faster, of course. Note that that heap is "reversed" -- the sort order is inverted, so that we can use a minheap. The top of the heap is the most marginal tuple in the top-n heap so far, and so is the next to be removed from consideration entirely (not the next to be returned to caller, when merging). Anyway, I just don't think that this is important enough to change -- it couldn't possibly be worth much of any risk. I can see the appeal of consistency, but I also see the appeal of sticking to how things work there: continually and explicitly inserting into and compacting the heap seems like a good enough way of framing what a top-n heap does, since there are no groupings of tuples (tapes) involved there. -- Peter Geoghegan
On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> I noticed, but here n = state->memtupcount >> >> + Assert(memtuples[0].tupindex == newtup->tupindex); >> + >> + CHECK_FOR_INTERRUPTS(); >> + >> + n = state->memtupcount; /* n is heap's size, >> including old root */ >> + imin = 0; /* >> start with caller's "hole" in root */ >> + i = imin; > > I'm fine with using "n" in the later assertion you mentioned, if > that's clearer to you. memtupcount is broken out as "n" simply because > that's less verbose, in a place where that makes things far clearer. > >> In fact, the assert on the patch would allow writing memtuples outside >> the heap, as in calling tuplesort_heap_root_displace if >> memtupcount==0, but I don't think that should be legal (memtuples[0] >> == memtuples[imin] would be outside the heap). > > You have to have a valid heap (i.e. there must be at least one > element) to call tuplesort_heap_root_displace(), and it doesn't > directly compact the heap, so it must remain valid on return. The > assertion exists to make sure that everything is okay with a > one-element heap, a case which is quite possible. More than using "n" or "memtupcount" what I'm saying is to assert that memtuples[imin] is inside the heap, which would catch the same errors the original assert would, and more. Assert(imin < state->memtupcount) If you prefer. The original asserts allows any value of imin for memtupcount>1, and that's my main concern. It shouldn't. On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Sure, that's a weird enough case (that assert up there already reads >> memtuples[0] which would be equally illegal if memtupcount==0), but it >> goes on to show that the assert expression just seems odd for its >> intent. >> >> BTW, I know it's not the scope of the patch, but shouldn't >> root_displace be usable on the TSS_BOUNDED phase? > > I don't think it should be, no. With a top-n heap sort, the > expectation is that after a little while, we can immediately determine > that most tuples do not belong in the heap (this will require more > than one comparison per tuple when the tuple that may be entered into > the heap will in fact go in the heap, which should be fairly rare > after a time). That's why that general strategy can be so much faster, > of course. I wasn't proposing getting rid of that optimization, but just replacing the siftup+insert step with root_displace... > Note that that heap is "reversed" -- the sort order is inverted, so > that we can use a minheap. The top of the heap is the most marginal > tuple in the top-n heap so far, and so is the next to be removed from > consideration entirely (not the next to be returned to caller, when > merging). ...but I didn't pause to consider that point. It still looks like a valid optimization, instead rearranging the heap twice (siftup + insert), do it once (replace + relocate). However, I agree that it's not worth the risk conflating the two optimizations. That one can be done later as a separate patch.
On Tue, Sep 6, 2016 at 5:50 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > However, I agree that it's not worth the risk conflating the two > optimizations. That one can be done later as a separate patch. I'm rather fond of the assertions about tape number that exist within root_displace currently. But, yeah, maybe. -- Peter Geoghegan
On 09/06/2016 10:42 PM, Peter Geoghegan wrote: > On Tue, Sep 6, 2016 at 12:39 PM, Peter Geoghegan <pg@heroku.com> wrote: >> On Tue, Sep 6, 2016 at 12:08 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>>> I attach a patch that changes how we maintain the heap invariant >>>> during tuplesort merging. >> >>> Nice! >> >> Thanks! > > BTW, the way that k-way merging is made more efficient by this > approach makes the case for replacement selection even weaker than it > was just before we almost killed it. This also makes the replacement selection cheaper, no? > I hate to say it, but I have to > wonder if we shouldn't get rid of the new-to-9.6 > replacement_sort_tuples because of this, and completely kill > replacement selection. I'm not going to go on about it, but that seems > sensible to me. Yeah, perhaps. But that's a different story. - Heikki
On Tue, Sep 6, 2016 at 10:28 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> BTW, the way that k-way merging is made more efficient by this >> approach makes the case for replacement selection even weaker than it >> was just before we almost killed it. > > > This also makes the replacement selection cheaper, no? Well, maybe, but the whole idea behind replacement_sort_tuples (by which I mean the continued occasional use of replacement selection by Postgres) was that we hope to avoid a merge step *entirely*. This new merge shift down heap patch could make the merge step so cheap as to be next to free anyway (in the even of presorted input), so the argument for replacement_sort_tuples is weakened further. It might always be cheaper once you factor in that the TSS_SORTEDONTAPE path for returning tuples to caller happens to not be able to use batch memory, even with something like collated text. And, as a bonus, you get something that works just as well with an inverse correlation, which was traditionally the worst case for replacement selection (it makes it produce runs no larger than those produced by quicksort). Anyway, I only mention this because it occurs to me. I have no desire to go back to talking about replacement selection either. Maybe it's useful to point this out, because it makes it clearer still that severely limiting the use of replacement selection in 9.6 was totally justified. -- Peter Geoghegan
On Tue, Sep 6, 2016 at 10:36 PM, Peter Geoghegan <pg@heroku.com> wrote: > Well, maybe, but the whole idea behind replacement_sort_tuples (by > which I mean the continued occasional use of replacement selection by > Postgres) was that we hope to avoid a merge step *entirely*. This new > merge shift down heap patch could make the merge step so cheap as to > be next to free anyway (in the even of presorted input) I mean: Cheaper than just processing the tuples to return to caller without comparisons/merging (within the TSS_SORTEDONTAPE path). I do not mean free in an absolute sense, of course. -- Peter Geoghegan
On 09/07/2016 12:46 AM, Peter Geoghegan wrote: > On Tue, Sep 6, 2016 at 12:34 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Why do we reserve the buffer space for all the tapes right at the beginning? >> Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) callin >> inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every time we >> start a new run, until we reach maxTapes? > > No, because then you have no way to clamp back memory, which is now > almost all used (we hold off from making LACKMEM() continually true, > if at all possible, which is almost always the case). You can't really > continually shrink memtuples to make space for new tapes, which is > what it would take. I still don't get it. When building the initial runs, we don't need buffer space for maxTapes yet, because we're only writing to a single tape at a time. An unused tape shouldn't take much memory. In inittapes(), when we have built all the runs, we know how many tapes we actually needed, and we can allocate the buffer memory accordingly. [thinks a bit, looks at logtape.c]. Hmm, I guess that's wrong, because of the way this all is implemented. When we're building the initial runs, we're only writing to one tape at a time, but logtape.c nevertheless holds onto a BLCKSZ'd currentBuffer, plus one buffer for each indirect level, for every tape that has been used so far. What if we changed LogicalTapeRewind to free those buffers? Flush out the indirect buffers to disk, remembering just the physical block number of the topmost indirect block in memory, and free currentBuffer. That way, a tape that has been used, but isn't being read or written to at the moment, would take very little memory, and we wouldn't need to reserve space for them in the build-runs phase. - Heikki
On Tue, Sep 6, 2016 at 10:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I still don't get it. When building the initial runs, we don't need buffer > space for maxTapes yet, because we're only writing to a single tape at a > time. An unused tape shouldn't take much memory. In inittapes(), when we > have built all the runs, we know how many tapes we actually needed, and we > can allocate the buffer memory accordingly. Right. That's correct. But, we're not concerned about physically allocated memory, but rather logically allocated memory (i.e., what goes into USEMEM()). tuplesort.c should be able to fully use the workMem specified by caller in the event of an external sort, just as with an internal sort. > [thinks a bit, looks at logtape.c]. Hmm, I guess that's wrong, because of > the way this all is implemented. When we're building the initial runs, we're > only writing to one tape at a time, but logtape.c nevertheless holds onto a > BLCKSZ'd currentBuffer, plus one buffer for each indirect level, for every > tape that has been used so far. What if we changed LogicalTapeRewind to free > those buffers? There isn't much point in that, because those buffers are never physically allocated in the first place when there are thousands. They are, however, entered into the tuplesort.c accounting as if they were, denying tuplesort.c the full benefit of available workMem. It doesn't matter if you USEMEM() or FREEMEM() after we first spill to disk, but before we begin the merge. (We already refund the unused-but-logically-allocated memory from unusued at the beginning of the merge (within beginmerge()), so we can't do any better than we already are from that point on -- that makes the batch memtuples growth thing slightly more effective.) -- Peter Geoghegan
On Tue, Sep 6, 2016 at 10:57 PM, Peter Geoghegan <pg@heroku.com> wrote: > There isn't much point in that, because those buffers are never > physically allocated in the first place when there are thousands. They > are, however, entered into the tuplesort.c accounting as if they were, > denying tuplesort.c the full benefit of available workMem. It doesn't > matter if you USEMEM() or FREEMEM() after we first spill to disk, but > before we begin the merge. (We already refund the > unused-but-logically-allocated memory from unusued at the beginning of > the merge (within beginmerge()), so we can't do any better than we > already are from that point on -- that makes the batch memtuples > growth thing slightly more effective.) The big picture here is that you can't only USEMEM() for tapes as the need arises for new tapes as new runs are created. You'll just run a massive availMem deficit, that you have no way of paying back, because you can't "liquidate assets to pay off your creditors" (e.g., release a bit of the memtuples memory). The fact is that memtuples growth doesn't work that way. The memtuples array never shrinks. -- Peter Geoghegan
On 09/07/2016 09:01 AM, Peter Geoghegan wrote: > On Tue, Sep 6, 2016 at 10:57 PM, Peter Geoghegan <pg@heroku.com> wrote: >> There isn't much point in that, because those buffers are never >> physically allocated in the first place when there are thousands. They >> are, however, entered into the tuplesort.c accounting as if they were, >> denying tuplesort.c the full benefit of available workMem. It doesn't >> matter if you USEMEM() or FREEMEM() after we first spill to disk, but >> before we begin the merge. (We already refund the >> unused-but-logically-allocated memory from unusued at the beginning of >> the merge (within beginmerge()), so we can't do any better than we >> already are from that point on -- that makes the batch memtuples >> growth thing slightly more effective.) > > The big picture here is that you can't only USEMEM() for tapes as the > need arises for new tapes as new runs are created. You'll just run a > massive availMem deficit, that you have no way of paying back, because > you can't "liquidate assets to pay off your creditors" (e.g., release > a bit of the memtuples memory). The fact is that memtuples growth > doesn't work that way. The memtuples array never shrinks. Hmm. But memtuples is empty, just after we have built the initial runs. Why couldn't we shrink, i.e. free and reallocate, it? - Heikki
On Tue, Sep 6, 2016 at 11:09 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> The big picture here is that you can't only USEMEM() for tapes as the >> need arises for new tapes as new runs are created. You'll just run a >> massive availMem deficit, that you have no way of paying back, because >> you can't "liquidate assets to pay off your creditors" (e.g., release >> a bit of the memtuples memory). The fact is that memtuples growth >> doesn't work that way. The memtuples array never shrinks. > > > Hmm. But memtuples is empty, just after we have built the initial runs. Why > couldn't we shrink, i.e. free and reallocate, it? After we've built the initial runs, we do in fact give a FREEMEM() refund to those tapes that were not used within beginmerge(), as I mentioned just now (with a high workMem, this is often the great majority of many thousands of logical tapes -- that's how you get to wasting 8% of 5GB of maintenance_work_mem). What's at issue with this 500 tapes cap patch is what happens after tuples are first dumped (after we decide that this is going to be an external sort -- where we call tuplesort_merge_order() to get the number of logical tapes in the tapeset), but before the final merge happens, where we're already doing the right thing for merging by giving that refund. I want to stop logical allocation (USEMEM()) of an enormous number of tapes, to make run generation itself able to use more memory. It's surprisingly difficult to do something cleverer than just impose a cap. -- Peter Geoghegan
On 09/07/2016 09:17 AM, Peter Geoghegan wrote: > On Tue, Sep 6, 2016 at 11:09 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> The big picture here is that you can't only USEMEM() for tapes as the >>> need arises for new tapes as new runs are created. You'll just run a >>> massive availMem deficit, that you have no way of paying back, because >>> you can't "liquidate assets to pay off your creditors" (e.g., release >>> a bit of the memtuples memory). The fact is that memtuples growth >>> doesn't work that way. The memtuples array never shrinks. >> >> >> Hmm. But memtuples is empty, just after we have built the initial runs. Why >> couldn't we shrink, i.e. free and reallocate, it? > > After we've built the initial runs, we do in fact give a FREEMEM() > refund to those tapes that were not used within beginmerge(), as I > mentioned just now (with a high workMem, this is often the great > majority of many thousands of logical tapes -- that's how you get to > wasting 8% of 5GB of maintenance_work_mem). I & peter chatted over IM on this. Let me try to summarize the problems, and my plan: 1. When we start to build the initial runs, we currently reserve memory for tape buffers, maxTapes * TAPE_BUFFER_OVERHEAD. But we only actually need the buffers for tapes that are really used. We "refund" the buffers for the unused tapes after we've built the initial runs, but we're still wasting that while building the initial runs. We didn't actually allocate it, but we could've used it for other things. Peter's solution to this was to put a cap on maxTapes. 2. My observation is that during the build-runs phase, you only actually need those tape buffers for the one tape you're currently writing to. When you switch to a different tape, you could flush and free the buffers for the old tape. So reserving maxTapes * TAPE_BUFFER_OVERHEAD is excessive, 1 * TAPE_BUFFER_OVERHEAD would be enough. logtape.c doesn't have an interface for doing that today, but it wouldn't be hard to add. 3. If we do that, we'll still have to reserve the tape buffers for all the tapes that we use during merge. So after we've built the initial runs, we'll need to reserve memory for those buffers. That might require shrinking memtuples. But that's OK: after building the initial runs, memtuples is empty, so we can shrink it. - Heikki
On Tue, Sep 6, 2016 at 8:28 PM, Peter Geoghegan <pg@heroku.com> wrote: >> In the meanwhile, I'll go and do some perf testing. >> >> Assuming the speedup is realized during testing, LGTM. > > Thanks. I suggest spending at least as much time on unsympathetic > cases (e.g., only 2 or 3 tapes must be merged). At the same time, I > suggest focusing on a type that has relatively expensive comparisons, > such as collated text, to make differences clearer. The tests are still running (the benchmark script I came up with runs for a lot longer than I anticipated, about 2 days), but preliminar results are very promising, I can see a clear and consistent speedup. We'll have to wait for the complete results to see if there's any significant regression, though. I'll post the full results when I have them, but until now it all looks like this: setup: create table lotsofitext(i text, j text, w text, z integer, z2 bigint); insert into lotsofitext select cast(random() * 1000000000.0 as text) || 'blablablawiiiiblabla', cast(random() * 1000000000.0 as text) || 'blablablawjjjblabla', cast(random() * 1000000000.0 as text) || 'blablabl awwwabla', random() * 1000000000.0, random() * 1000000000000.0 from generate_series(1, 10000000); timed: select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t; Unpatched Time: 100351.251 ms Patched Time: 75180.787 ms That's like a 25% speedup on random input. As we say over here, rather badly translated, not a turkey's boogers (meaning "nice!") On Tue, Sep 6, 2016 at 9:50 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan <pg@heroku.com> wrote: >> On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> I noticed, but here n = state->memtupcount >>> >>> + Assert(memtuples[0].tupindex == newtup->tupindex); >>> + >>> + CHECK_FOR_INTERRUPTS(); >>> + >>> + n = state->memtupcount; /* n is heap's size, >>> including old root */ >>> + imin = 0; /* >>> start with caller's "hole" in root */ >>> + i = imin; >> >> I'm fine with using "n" in the later assertion you mentioned, if >> that's clearer to you. memtupcount is broken out as "n" simply because >> that's less verbose, in a place where that makes things far clearer. >> >>> In fact, the assert on the patch would allow writing memtuples outside >>> the heap, as in calling tuplesort_heap_root_displace if >>> memtupcount==0, but I don't think that should be legal (memtuples[0] >>> == memtuples[imin] would be outside the heap). >> >> You have to have a valid heap (i.e. there must be at least one >> element) to call tuplesort_heap_root_displace(), and it doesn't >> directly compact the heap, so it must remain valid on return. The >> assertion exists to make sure that everything is okay with a >> one-element heap, a case which is quite possible. > > More than using "n" or "memtupcount" what I'm saying is to assert that > memtuples[imin] is inside the heap, which would catch the same errors > the original assert would, and more. > > Assert(imin < state->memtupcount) > > If you prefer. > > The original asserts allows any value of imin for memtupcount>1, and > that's my main concern. It shouldn't. So, for the assertions to properly avoid clobbering/reading out of bounds memory, you need both the above assert: + */+ memtuples[i] = memtuples[imin];+ i = imin;+ }+ >+ Assert(imin < state->memtupcount);+ memtuples[imin] = *newtup;+} And another one at the beginning, asserting: + SortTuple *memtuples = state->memtuples;+ int n,+ imin,+ i;+ >+ Assert(state->memtupcount > 0 && memtuples[0].tupindex == newtup->tupindex);++ CHECK_FOR_INTERRUPTS(); It's worth making that change, IMHO, unless I'm missing something.
On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > setup: > > create table lotsofitext(i text, j text, w text, z integer, z2 bigint); > insert into lotsofitext select cast(random() * 1000000000.0 as text) > || 'blablablawiiiiblabla', cast(random() * 1000000000.0 as text) || > 'blablablawjjjblabla', cast(random() * 1000000000.0 as text) || > 'blablabl > awwwabla', random() * 1000000000.0, random() * 1000000000000.0 from > generate_series(1, 10000000); > > timed: > > select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t; > > Unpatched Time: 100351.251 ms > Patched Time: 75180.787 ms > > That's like a 25% speedup on random input. As we say over here, rather > badly translated, not a turkey's boogers (meaning "nice!") Cool! What work_mem setting were you using here? >> More than using "n" or "memtupcount" what I'm saying is to assert that >> memtuples[imin] is inside the heap, which would catch the same errors >> the original assert would, and more. >> >> Assert(imin < state->memtupcount) >> >> If you prefer. >> >> The original asserts allows any value of imin for memtupcount>1, and >> that's my main concern. It shouldn't. > > So, for the assertions to properly avoid clobbering/reading out of > bounds memory, you need both the above assert: > > + */ > + memtuples[i] = memtuples[imin]; > + i = imin; > + } > + >>+ Assert(imin < state->memtupcount); > + memtuples[imin] = *newtup; > +} > > And another one at the beginning, asserting: > > + SortTuple *memtuples = state->memtuples; > + int n, > + imin, > + i; > + >>+ Assert(state->memtupcount > 0 && memtuples[0].tupindex == newtup->tupindex); > + > + CHECK_FOR_INTERRUPTS(); > > It's worth making that change, IMHO, unless I'm missing something. You're supposed to just not call it with an empty heap, so the assertions trust that much. I'll look into that. Currently, producing a new revision of this entire patchset. Improving the cost model (used when the parallel_workers storage parameter is not specified within CREATE INDEX) is taking a bit of time, but hope to have it out in the next couple of days. -- Peter Geoghegan
On Thu, Sep 8, 2016 at 2:13 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> setup: >> >> create table lotsofitext(i text, j text, w text, z integer, z2 bigint); >> insert into lotsofitext select cast(random() * 1000000000.0 as text) >> || 'blablablawiiiiblabla', cast(random() * 1000000000.0 as text) || >> 'blablablawjjjblabla', cast(random() * 1000000000.0 as text) || >> 'blablabl >> awwwabla', random() * 1000000000.0, random() * 1000000000000.0 from >> generate_series(1, 10000000); >> >> timed: >> >> select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t; >> >> Unpatched Time: 100351.251 ms >> Patched Time: 75180.787 ms >> >> That's like a 25% speedup on random input. As we say over here, rather >> badly translated, not a turkey's boogers (meaning "nice!") > > Cool! What work_mem setting were you using here? The script iterates over a few variations of string patterns (easy comparisons vs hard comparisons), work mem (4MB, 64MB, 256MB, 1GB, 4GB), and table sizes (~350M, ~650M, ~1.5G). That particular case I believe is using work_mem=4MB, easy strings, 1.5GB table.
On Thu, Sep 8, 2016 at 10:18 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > That particular case I believe is using work_mem=4MB, easy strings, 1.5GB table. Cool. I wonder where this leaves Heikki's draft patch, that completely removes batch memory, etc. -- Peter Geoghegan
On Wed, Sep 7, 2016 at 2:36 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > 3. If we do that, we'll still have to reserve the tape buffers for all the > tapes that we use during merge. So after we've built the initial runs, we'll > need to reserve memory for those buffers. That might require shrinking > memtuples. But that's OK: after building the initial runs, memtuples is > empty, so we can shrink it. Do you really think all this is worth the effort? Given how things are going to improve for merging anyway, I tend to doubt it. I'd rather just apply the cap (not necessarily 501 tapes, but something), and be done with it. As you know, Knuth never advocated more than 7 tapes at once, which I don't think had anything to do with the economics of tape drives in the 1970s (or problems with tape operators getting repetitive strange injuries). There is a chart in volume 3 about this. Senior hackers talked about a cap like this from day one, back in 2006, when Simon and Tom initially worked on scaling the number of tapes. Alternatively, we could make MERGE_BUFFER_SIZE much larger, which I think would be a good idea independent of whatever waste logically allocation of never-used tapes presents us with. It's currently 1/4 of 1MiB, which is hardly anything these days, and doesn't seem to have much to do with OS read ahead trigger sizes. If we were going to do something like you describe here, I'd prefer it to be driven by an observable benefit in performance, rather than a theoretical benefit. Not doing everything in one pass isn't necessarily worse than having a less cache efficient heap -- it might be quite a bit better, in fact. You've seen how hard it can be to get a sort that is I/O bound. (Sorting will tend to not be completely I/O bound, unless perhaps parallelism is used). Anyway, this patch (patch 0001-*) is by far the least important of the 3 that you and Claudio are signed up to review. I don't think it's worth bending over backwards to do better. If you're not comfortable with a simple cap like this, than I'd suggest that we leave it at that, since our time is better spent elsewhere. We can just shelve it for now -- "returned with feedback". I wouldn't make any noise about it (although, I actually don't think that the cap idea is at all controversial). -- Peter Geoghegan
On Thu, Sep 8, 2016 at 2:18 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Thu, Sep 8, 2016 at 2:13 PM, Peter Geoghegan <pg@heroku.com> wrote: >> On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> setup: >>> >>> create table lotsofitext(i text, j text, w text, z integer, z2 bigint); >>> insert into lotsofitext select cast(random() * 1000000000.0 as text) >>> || 'blablablawiiiiblabla', cast(random() * 1000000000.0 as text) || >>> 'blablablawjjjblabla', cast(random() * 1000000000.0 as text) || >>> 'blablabl >>> awwwabla', random() * 1000000000.0, random() * 1000000000000.0 from >>> generate_series(1, 10000000); >>> >>> timed: >>> >>> select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t; >>> >>> Unpatched Time: 100351.251 ms >>> Patched Time: 75180.787 ms >>> >>> That's like a 25% speedup on random input. As we say over here, rather >>> badly translated, not a turkey's boogers (meaning "nice!") >> >> Cool! What work_mem setting were you using here? > > The script iterates over a few variations of string patterns (easy > comparisons vs hard comparisons), work mem (4MB, 64MB, 256MB, 1GB, > 4GB), and table sizes (~350M, ~650M, ~1.5G). > > That particular case I believe is using work_mem=4MB, easy strings, 1.5GB table. Well, the worst regression I see is under the noise for this test (which seems rather high at 5%, but it's expectable since it's mostly big queries). Most samples show an improvement, either marginal or significant. The most improvement is, naturally, on low work_mem settings. I don't see significant slowdown on work_mem settings that should result in just a few tapes being merged, but I didn't instrument to check how many tapes were being merged in any case. Attached are the results both in ods, csv and raw formats. I think these are good results. So, to summarize the review: - Patch seems to follow the coding conventions of surrounding code - Applies cleanly on top of25794e841e5b86a0f90fac7f7f851e5d950e51e2, plus patches 1 and 2. - Builds without warnings - Passes regression tests - IMO has sufficient coverage from existing tests (none added) - Does not introduce any significant performance regression - Best improvement of 67% (reduction of runtime to 59%) - Average improvement of 30% (reduction of runtime to 77%) - Worst regression of 5% (increase of runtime to 105%), which is under the noise for control queries, so not significant - Performance improvement is highly quite desirable in this merge step, as it's a big bottleneck on parallel sort (and seems also regular sort) - All testing was done on random input, presorted input *will* show more pronounced improvements I suggested to change a few asserts in tuplesort_heap_root_displace to make the debug code stricter in checking the assumptions, but they're not blockers: + Assert(state->memtupcount > 1 || imin == 0); + memtuples[imin] = *newtup; Into + Assert(imin < state->memtupcount); + memtuples[imin] = *newtup; And, perhaps as well, + Assert(memtuples[0].tupindex == newtup->tupindex); + + CHECK_FOR_INTERRUPTS(); into + Assert(state->memtupcount > 0 && memtuples[0].tupindex == newtup->tupindex); + + CHECK_FOR_INTERRUPTS(); It was suggested that both tuplesort_heap_siftup and tuplesort_heap_root_displace could be wrappers around a common "siftup" implementation, since the underlying operation is very similar. Since it is true that doing so would make it impossible to keep the asserts about tupindex in tuplesort_heap_root_displace, I guess it depends on how useful those asserts are (ie: how likely it is that those conditions could be violated, and how damaging it could be if they were). If it is decided the refactor is desirable, I'd suggest making the common siftup producedure static inline, to allow tuplesort_heap_root_displace to inline and specialize it, since it will be called with checkIndex=False and that simplifies the resulting code considerably. Peter also mentioned that there were some other changes going on in the surrounding code that could impact this patch, so I'm marking the patch Waiting on Author. Overall, however, I believe the patch is in good shape. Only minor form issues need to be changed, the functionality seems both desirable and ready.
Attachment
... On Fri, Sep 9, 2016 at 9:22 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > Since it is true that doing so would make it impossible to keep the > asserts about tupindex in tuplesort_heap_root_displace, I guess it > depends on how useful those asserts are (ie: how likely it is that > those conditions could be violated, and how damaging it could be if > they were). If it is decided the refactor is desirable, I'd suggest > making the common siftup producedure static inline, to allow > tuplesort_heap_root_displace to inline and specialize it, since it > will be called with checkIndex=False and that simplifies the resulting > code considerably. > > Peter also mentioned that there were some other changes going on in > the surrounding code that could impact this patch, so I'm marking the > patch Waiting on Author. > > Overall, however, I believe the patch is in good shape. Only minor > form issues need to be changed, the functionality seems both desirable > and ready. Sorry, forgot to specify, that was all about patch 3, the one about tuplesort_heap_root_displace.
On Fri, Sep 9, 2016 at 5:22 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > Since it is true that doing so would make it impossible to keep the > asserts about tupindex in tuplesort_heap_root_displace, I guess it > depends on how useful those asserts are (ie: how likely it is that > those conditions could be violated, and how damaging it could be if > they were). If it is decided the refactor is desirable, I'd suggest > making the common siftup producedure static inline, to allow > tuplesort_heap_root_displace to inline and specialize it, since it > will be called with checkIndex=False and that simplifies the resulting > code considerably. Right. I want to keep it as a separate function for all these reasons. I also think that I'll end up further optimizing what I've called tuplesort_heap_root_displace in the future, to adopt to clustered input. I'm thinking of something like Timsort's "galloping mode". What I've come up with here still needs 2 comparisons and a swap per call for presorted input. There is still a missed opportunity for clustered or (inverse) correlated input -- we can make merging opportunistically skip ahead to determine that the root tape's 100th tuple (say) would still fit in the root position of the merge minheap. So, immediately return 100 tuples from the root's tape without bothering to compare them to anything. Do a binary search to find the best candidate minheap root before the 100th tuple if a guess of 100 doesn't work out. Adapt to trends. Stuff like that. -- Peter Geoghegan
On 09/10/2016 03:22 AM, Claudio Freire wrote: > Overall, however, I believe the patch is in good shape. Only minor > form issues need to be changed, the functionality seems both desirable > and ready. Pushed this "displace root" patch, with some changes: * I renamed "tuplesort_heap_siftup()" to "tuplesort_delete_top()". I realize that this is controversial, per the discussion on the "Is tuplesort_heap_siftup() a misnomer?" thread. However, now that we have a new function, "tuplesort_heap_replace_top()", which is exactly the same algorithm as the "delete_top()" algorithm, calling one of them "siftup" became just too confusing. If anything, the new "replace_top" corresponds more closely to Knuth's siftup algorithm; delete-top is a special case of it. I added a comment on that to replace_top. I hope everyone can live with this. * Instead of "root_displace", I used the name "replace_top", and "delete_top" for the old siftup function. Because we use "top" to refer to memtuples[0] more commonly than "root", in the existing comments. * I shared the code between the delete-top and replace-top. Delete-top now calls the replace-top function, with the last element of the heap. Both functions have the same signature, i.e. they both take the checkIndex argument. Peter's patch left that out for the "replace" function, on performance grounds, but if that's worthwhile, that seems like a separate optimization. Might be worth benchmarking that separately, but I didn't want to conflate that with this patch. * I replaced a few more siftup+insert calls with the new combined replace-top operation. Because why not. Thanks for the patch, Peter, and thanks for the review, Claudio! - Heikki
On Sun, Sep 11, 2016 at 6:28 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > * I renamed "tuplesort_heap_siftup()" to "tuplesort_delete_top()". I realize > that this is controversial, per the discussion on the "Is > tuplesort_heap_siftup() a misnomer?" thread. However, now that we have a new > function, "tuplesort_heap_replace_top()", which is exactly the same > algorithm as the "delete_top()" algorithm, calling one of them "siftup" > became just too confusing. I feel pretty strongly that this was the correct decision. I would have gone further, and removed any mention of "Sift up", but you can't win them all. > * Instead of "root_displace", I used the name "replace_top", and > "delete_top" for the old siftup function. Because we use "top" to refer to > memtuples[0] more commonly than "root", in the existing comments. Fine by me. > * I shared the code between the delete-top and replace-top. Delete-top now > calls the replace-top function, with the last element of the heap. Both > functions have the same signature, i.e. they both take the checkIndex > argument. Peter's patch left that out for the "replace" function, on > performance grounds, but if that's worthwhile, that seems like a separate > optimization. Might be worth benchmarking that separately, but I didn't want > to conflate that with this patch. Okay. > * I replaced a few more siftup+insert calls with the new combined > replace-top operation. Because why not. I suppose that the consistency has value, from a code clarity standpoint. > Thanks for the patch, Peter, and thanks for the review, Claudio! Thanks Heikki! -- Peter Geoghegan
On Sun, Sep 11, 2016 at 6:28 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Pushed this "displace root" patch, with some changes: Attached is rebased version of the entire patch series, which should be applied on top of what you pushed to the master branch today. This features a new scheme for managing workMem -- maintenance_work_mem is now treated as a high watermark/budget for the entire CREATE INDEX operation, regardless of the number of workers. This seems to work much better, so Robert was right to suggest it. There were also improvements to the cost model, to weigh available maintenance_work_mem under this new system. And, the cost model was moved inside planner.c (next to plan_cluster_use_sort()), which is really where it belongs. The cost model is still WIP, though, and I didn't address some concerns of my own about how tuplesort.c coordinates workers. I think that Robert's "condition variables" will end up superseding that stuff anyway. And, I think that this v2 will bitrot fairly soon, when Heikki commits what is in effect his version of my 0002-* patch (that's unchanged, if only because it refactors some things that the parallel CREATE INDEX patch is reliant on). So, while there are still a few loose ends with this revision (it should still certainly be considered WIP), I wanted to get a revision out quickly because V1 has been left to bitrot for too long now, and my schedule is very full for the next week, ahead of my leaving to go on vacation (which is long overdue). Hopefully, I'll be able to get out a third revision next Saturday, on top of the by-then-presumably-committed new tape batch memory patch from Heikki, just before I leave. I'd rather leave with a patch available that can be cleanly applied, to make review as easy as possible, since it wouldn't be great to have this V2 with bitrot for 10 days or more. -- Peter Geoghegan
Attachment
On Sun, Sep 11, 2016 at 2:05 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Sun, Sep 11, 2016 at 6:28 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Pushed this "displace root" patch, with some changes: > > Attached is rebased version of the entire patch series, which should > be applied on top of what you pushed to the master branch today. 0003 looks like a sensible cleanup of our #include structure regardless of anything this patch series is trying to accomplish, so I've committed it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 08/02/2016 01:18 AM, Peter Geoghegan wrote: > Tape unification > ---------------- > > Sort operations have a unique identifier, generated before any workers > are launched, using a scheme based on the leader's PID, and a unique > temp file number. This makes all on-disk state (temp files managed by > logtape.c) discoverable by the leader process. State in shared memory > is sized in proportion to the number of workers, so the only thing > about the data being sorted that gets passed around in shared memory > is a little logtape.c metadata for tapes, describing for example how > large each constituent BufFile is (a BufFile associated with one > particular worker's tapeset). > > (See below also for notes on buffile.c's role in all of this, fd.c and > resource management, etc.) > > ... > > buffile.c, and "unification" > ============================ > > There has been significant new infrastructure added to make logtape.c > aware of workers. buffile.c has in turn been taught about unification > as a first class part of the abstraction, with low-level management of > certain details occurring within fd.c. So, "tape unification" within > processes to open other backend's logical tapes to generate a unified > logical tapeset for the leader to merge is added. This is probably the > single biggest source of complexity for the patch, since I must > consider: > > * Creating a general, reusable abstraction for other possible BufFile > users (logtape.c only has to serve tuplesort.c, though). > > * Logical tape free space management. > > * Resource management, file lifetime, etc. fd.c resource management > can now close a file at xact end for temp files, while not deleting it > in the leader backend (only the "owning" worker backend deletes the > temp file it owns). > > * Crash safety (e.g., when to truncate existing temp files, and when not to). I find this unification business really complicated. I think it'd be simpler to keep the BufFiles and LogicalTapeSets separate, and instead teach tuplesort.c how to merge tapes that live on different LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single LogicalTapeSet can contain tapes from different underlying BufFiles. What I have in mind is something like the attached patch. It refactors LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet doesn't have the concept of a tape number anymore, it can contain any number of tapes, and you can create more on the fly. With that, it'd be fairly easy to make tuplesort.c merge LogicalTapes that came from different tape sets, backed by different BufFiles. I think that'd avoid much of the unification code. That leaves one problem, though: reusing space in the final merge phase. If the tapes being merged belong to different LogicalTapeSets, and create one new tape to hold the result, the new tape cannot easily reuse the space of the input tapes because they are on different tape sets. But looking at your patch, ISTM you actually dodged that problem as well: > + * As a consequence of only being permitted to write to the leader > + * controlled range, parallel sorts that require a final materialized tape > + * will use approximately twice the disk space for temp files compared to > + * a more or less equivalent serial sort. This is deemed acceptable, > + * since it is far rarer in practice for parallel sort operations to > + * require a final materialized output tape. Note that this does not > + * apply to any merge process required by workers, which may reuse space > + * eagerly, just like conventional serial external sorts, and so > + * typically, parallel sorts consume approximately the same amount of disk > + * blocks as a more or less equivalent serial sort, even when workers must > + * perform some merging to produce input to the leader. I'm slightly worried about that. Maybe it's OK for a first version, but it'd be annoying in a query where a sort is below a merge join, for example, so that you can't do the final merge on the fly because mark/restore support is needed. One way to fix that would be have all the parallel works share the work files to begin with, and keep the "nFileBlocks" value in shared memory so that the workers won't overlap each other. Then all the blocks from different workers would be mixed together, though, which would hurt the sequential pattern of the tapes, so each workers would need to allocate larger chunks to avoid that. - Heikki
Attachment
On 08/02/2016 01:18 AM, Peter Geoghegan wrote: > No merging in parallel > ---------------------- > > Currently, merging worker *output* runs may only occur in the leader > process. In other words, we always keep n worker processes busy with > scanning-and-sorting (and maybe some merging), but then all processes > but the leader process grind to a halt (note that the leader process > can participate as a scan-and-sort tuplesort worker, just as it will > everywhere else, which is why I specified "parallel_workers = 7" but > talked about 8 workers). > > One leader process is kept busy with merging these n output runs on > the fly, so things will bottleneck on that, which you saw in the > example above. As already described, workers will sometimes merge in > parallel, but only their own runs -- never another worker's runs. I > did attempt to address the leader merge bottleneck by implementing > cross-worker run merging in workers. I got as far as implementing a > very rough version of this, but initial results were disappointing, > and so that was not pursued further than the experimentation stage. > > Parallel merging is a possible future improvement that could be added > to what I've come up with, but I don't think that it will move the > needle in a really noticeable way. It'd be good if you could overlap the final merges in the workers with the merge in the leader. ISTM it would be quite straightforward to replace the final tape of each worker with a shared memory queue, so that the leader could start merging and returning tuples as soon as it gets the first tuple from each worker. Instead of having to wait for all the workers to complete first. - Heikki
On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > It'd be good if you could overlap the final merges in the workers with the > merge in the leader. ISTM it would be quite straightforward to replace the > final tape of each worker with a shared memory queue, so that the leader > could start merging and returning tuples as soon as it gets the first tuple > from each worker. Instead of having to wait for all the workers to complete > first. If you do that, make sure to have the leader read multiple tuples at a time from each worker whenever possible. It makes a huge difference to performance. See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Sep 21, 2016 at 5:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I find this unification business really complicated. I can certainly understand why you would. As I said, it's the most complicated part of the patch, which overall is one of the most ambitious patches I've ever written. > I think it'd be simpler > to keep the BufFiles and LogicalTapeSets separate, and instead teach > tuplesort.c how to merge tapes that live on different > LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single > LogicalTapeSet can contain tapes from different underlying BufFiles. > > What I have in mind is something like the attached patch. It refactors > LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape > as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet > doesn't have the concept of a tape number anymore, it can contain any number > of tapes, and you can create more on the fly. With that, it'd be fairly easy > to make tuplesort.c merge LogicalTapes that came from different tape sets, > backed by different BufFiles. I think that'd avoid much of the unification > code. I think that it won't be possible to make a LogicalTapeSet ever use more than one BufFile without regressing the ability to eagerly reuse space, which is almost the entire reason for logtape.c existing. The whole indirect block thing is an idea borrowed from the FS world, of course, and so logtape.c needs one block-device-like BufFile, with blocks that can be reclaimed eagerly, but consumed for recycling in *contiguous* order (which is why they're sorted using qsort() within ltsGetFreeBlock()). You're going to increase the amount of random I/O by using more than one BufFile for an entire tapeset, I think. This patch you posted ("0001-Refactor-LogicalTapeSet-LogicalTape-interface.patch") just keeps one BufFile, and only changes the interface to expose the tapes themselves to tuplesort.c, without actually making tuplesort.c do anything with that capability. I see what you're getting at, I think, but I don't see how that accomplishes all that much for parallel CREATE INDEX. I mean, the special case of having multiple tapesets from workers (not one "unified" tapeset created from worker temp files from their tapesets to begin with) now needs special treatment. Haven't you just moved the complexity around (once your patch is made to care about parallelism)? Having multiple entire tapesets explicitly from workers, with their own BufFiles, is not clearly less complicated than managing ranges from BufFile fd.c files with delineated ranges of "logical tapeset space". Seems almost equivalent, except that my way doesn't bother tuplesort.c with any of this. >> + * As a consequence of only being permitted to write to the leader >> + * controlled range, parallel sorts that require a final >> materialized tape >> + * will use approximately twice the disk space for temp files >> compared to >> + * a more or less equivalent serial sort. > I'm slightly worried about that. Maybe it's OK for a first version, but it'd > be annoying in a query where a sort is below a merge join, for example, so > that you can't do the final merge on the fly because mark/restore support is > needed. My intuition is that we'll *never* end up using this for merge joins. I think that I could do better here (why should workers really care at this point?), but just haven't bothered to. This parallel sort implementation is something written with CREATE INDEX and CLUSTER in mind only (maybe one or two other things, too). I believe that for query execution, partitioning is the future [1]. With merge joins, partitioning is desirable because it lets you push down *everything* to workers, not just sorting (e.g., by aligning partitioning boundaries on each side of each merge join sort in the worker, and having the worker also "synchronize" each side of the join, all independently and without a dependency on a final merge). That's why I think it's okay that I use twice as much space for randomAccess tuplesort.c callers. No real world caller will ever end up needing to do this. It just seems like a good idea to support randomAccess when using this new infrastructure, on general principle. Forcing myself to support that case during initial development actually resulted in much cleaner, less invasive changes to tuplesort.c in general. [1] https://www.postgresql.org/message-id/flat/CAM3SWZR+ATYAzyMT+hm-Bo=1L1smtJbNDtibwBTKtYqS0dYZVg@mail.gmail.com#CAM3SWZR+ATYAzyMT+hm-Bo=1L1smtJbNDtibwBTKtYqS0dYZVg@mail.gmail.com -- Peter Geoghegan
On Thu, Sep 22, 2016 at 8:57 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> It'd be good if you could overlap the final merges in the workers with the >> merge in the leader. ISTM it would be quite straightforward to replace the >> final tape of each worker with a shared memory queue, so that the leader >> could start merging and returning tuples as soon as it gets the first tuple >> from each worker. Instead of having to wait for all the workers to complete >> first. > > If you do that, make sure to have the leader read multiple tuples at a > time from each worker whenever possible. It makes a huge difference > to performance. See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167. That requires some kind of mutual exclusion mechanism, like an LWLock. It's possible that merging everything lazily is actually the faster approach, given this, and given the likely bottleneck on I/O at htis stage. It's also certainly simpler to not overlap things. This is something I've read about before [1], with "eager evaluation" sorting not necessarily coming out ahead IIRC. [1] http://digitalcommons.ohsu.edu/cgi/viewcontent.cgi?article=1193&context=csetech -- Peter Geoghegan
On Sat, Sep 24, 2016 at 9:07 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Sep 22, 2016 at 8:57 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> It'd be good if you could overlap the final merges in the workers with the >>> merge in the leader. ISTM it would be quite straightforward to replace the >>> final tape of each worker with a shared memory queue, so that the leader >>> could start merging and returning tuples as soon as it gets the first tuple >>> from each worker. Instead of having to wait for all the workers to complete >>> first. >> >> If you do that, make sure to have the leader read multiple tuples at a >> time from each worker whenever possible. It makes a huge difference >> to performance. See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167. > > That requires some kind of mutual exclusion mechanism, like an LWLock. No, it doesn't. Shared memory queues are single-reader, single-writer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Sep 26, 2016 at 6:58 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> That requires some kind of mutual exclusion mechanism, like an LWLock. > > No, it doesn't. Shared memory queues are single-reader, single-writer. The point is that there is a natural dependency when merging is performed eagerly within the leader. One thing needs to be in lockstep with the others. That's all. -- Peter Geoghegan
On Mon, Sep 26, 2016 at 3:40 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Sep 26, 2016 at 6:58 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> That requires some kind of mutual exclusion mechanism, like an LWLock. >> >> No, it doesn't. Shared memory queues are single-reader, single-writer. > > The point is that there is a natural dependency when merging is > performed eagerly within the leader. One thing needs to be in lockstep > with the others. That's all. I don't know what any of that means. You said we need something like an LWLock, but I think we don't. The workers just write the results of their own final merges into shm_mqs. The leader can read from any given shm_mq until no more tuples can be read without blocking, just like nodeGather.c does, or at least it can do that unless its own queue fills up first. No mutual exclusion mechanism is required for any of that, as far as I can see - not an LWLock, and not anything similar. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Sep 11, 2016 at 11:05 AM, Peter Geoghegan <pg@heroku.com> wrote: > So, while there are still a few loose ends with this revision (it > should still certainly be considered WIP), I wanted to get a revision > out quickly because V1 has been left to bitrot for too long now, and > my schedule is very full for the next week, ahead of my leaving to go > on vacation (which is long overdue). Hopefully, I'll be able to get > out a third revision next Saturday, on top of the > by-then-presumably-committed new tape batch memory patch from Heikki, > just before I leave. I'd rather leave with a patch available that can > be cleanly applied, to make review as easy as possible, since it > wouldn't be great to have this V2 with bitrot for 10 days or more. Heikki committed his preload memory patch a little later than originally expected, 4 days ago. I attach V3 of my own parallel CREATE INDEX patch, which should be applied on top of a today's git master (there is a bugfix that reviewers won't want to miss -- commit b56fb691). I have my own test suite, and have to some extent used TDD for this patch, so rebasing was not so bad . My tests are rather rough and ready, so I'm not going to post them here. (Changes in the WaitLatch() API also caused bitrot, which is now fixed.) Changes from V2: * Since Heikki eliminated the need for any extra memtuple "slots" (memtuples is now only exactly big enough for the initial merge heap), an awful lot of code could be thrown out that managed sizing memtuples in the context of the parallel leader (based on trends seen in parallel workers). I was able to follow Heikki's example by eliminating code for parallel sorting memtuples sizing. Throwing this code out let me streamline a lot of stuff within tuplesort.c, which is cleaned up quite a bit. * Since this revision was mostly focused on fixing up logtape.c (rebasing on top of Heikki's work), I also took the time to clarify some things about how an block-based offset might need to be applied within the leader. Essentially, outlining how and where that happens, and where it doesn't and shouldn't happen. (An offset must sometimes be applied to compensate for difference in logical BufFile positioning (leader/worker differences) following leader's unification of worker tapesets into one big tapset of its own.) * max_parallel_workers_maintenance now supersedes the use of the new parallel_workers index storage parameter. This matches existing heap storage parameter behavior, and allows the implementation to add practically no cycles as compared to master branch when the use of parallelism is disabled by setting max_parallel_workers_maintenance to 0. * New additions to the chapter in the documentation that Robert added a little while back, "Chapter 15. Parallel Query". It's perhaps a bit of a stretch to call this feature part of parallel query, but I think that it works reasonably well. The optimizer does determine the number of workers needed here, so while it doesn't formally produce a query plan, I think the implication that it does is acceptable for user-facing documentation. (Actually, it would be nice if you really could EXPLAIN utility commands -- that would be a handy place to show information about how they were executed.) Maybe this new documentation describes things in what some would consider to be excessive detail for users. The relatively detailed information added on parallel sorting seemed to be in the pragmatic spirit of the new chapter 15, so I thought I'd see what people thought. Work is still needed on: * Cost model. Should probably attempt to guess final index size, and derive calculation of number of workers from that. Also, I'm concerned that I haven't given enough thought to the low end, where with default settings most CREATE INDEX statements will use at least one parallel worker. * The whole way that I teach nbtsort.c to disallow catalog tables for parallel CREATE INDEX due to concerns about parallel safety is in need of expert review, preferably from Robert. It's complicated in a way that relies on things happening or not happening from a distance. * Heikki seems to want to change more about logtape.c, and its use of indirection blocks. That may evolve, but for now I can only target the master branch. * More extensive performance testing. I think that this V3 is probably the fastest version yet, what with Heikki's improvements, but I haven't really verified that. -- Peter Geoghegan
Attachment
On Fri, Oct 7, 2016 at 5:47 PM, Peter Geoghegan <pg@heroku.com> wrote: > Work is still needed on: > > * Cost model. Should probably attempt to guess final index size, and > derive calculation of number of workers from that. Also, I'm concerned > that I haven't given enough thought to the low end, where with default > settings most CREATE INDEX statements will use at least one parallel > worker. > > * The whole way that I teach nbtsort.c to disallow catalog tables for > parallel CREATE INDEX due to concerns about parallel safety is in need > of expert review, preferably from Robert. It's complicated in a way > that relies on things happening or not happening from a distance. > > * Heikki seems to want to change more about logtape.c, and its use of > indirection blocks. That may evolve, but for now I can only target the > master branch. > > * More extensive performance testing. I think that this V3 is probably > the fastest version yet, what with Heikki's improvements, but I > haven't really verified that. I realize that you are primarily targeting utility commands here, and that is obviously great, because making index builds faster is very desirable. However, I'd just like to talk for a minute about how this relates to parallel query. With Rushabh's Gather Merge patch, you can now have a plan that looks like Gather Merge -> Sort -> whatever. That patch also allows other patterns that are useful completely independently of this patch, like Finalize GroupAggregate -> Gather Merge -> Partial GroupAggregate -> Sort -> whatever, but the Gather Merge -> Sort -> whatever path is very related to what this patch does. For example, instead of committing this patch at all, we could try to funnel index creation through the executor, building a plan of that shape, and using the results to populate the index. I'm not saying that's a good idea, but it could be done. On the flip side, what if anything can queries hope to get out of parallel sort that they can't get out of Gather Merge? One possibility is that a parallel sort might end up being substantially faster than Gather Merge-over-non-parallel sort. In that case, we obviously should prefer it. Other possibilities seem a little obscure. For example, it's possible that you might want to have all workers participate in sorting some data and then divide the result of the sort into equal ranges that are again divided among the workers, or that you might want all workers to sort and then each worker to read a complete copy of the output data. But these don't seem like particularly mainstream needs, nor do they necessarily seem like problems that parallel sort itself should be trying to solve. The Volcano paper[1], one of the oldest and most-cited sources I can find for research into parallel execution and with a design fairly similar to our own executor, describes various variants of what they call Exchange, of which what we now call Gather is one. They describe another variant called Interchange, which acts like a Gather node without terminating parallelism: every worker process reads the complete output of an Interchange, which is the union of all rows produced by all workers running the Interchange's input plan. That seems like a better design than coupling such data flows specifically to parallel sort. I'd like to think that parallel sort will help lots of queries, as well as helping utility commands, but I'm not sure it will. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] "Volcano - an Extensible and Parallel Query Evaluation System", https://pdfs.semanticscholar.org/865b/5f228f08ebac0b68d3a4bfd97929ee85e4b6.pdf [2] See "C. Variants of the Exchange Operator" on p. 13 of [1]
On Wed, Oct 12, 2016 at 11:09 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I realize that you are primarily targeting utility commands here, and > that is obviously great, because making index builds faster is very > desirable. However, I'd just like to talk for a minute about how this > relates to parallel query. With Rushabh's Gather Merge patch, you can > now have a plan that looks like Gather Merge -> Sort -> whatever. > That patch also allows other patterns that are useful completely > independently of this patch, like Finalize GroupAggregate -> Gather > Merge -> Partial GroupAggregate -> Sort -> whatever, but the Gather > Merge -> Sort -> whatever path is very related to what this patch > does. For example, instead of committing this patch at all, we could > try to funnel index creation through the executor, building a plan of > that shape, and using the results to populate the index. I'm not > saying that's a good idea, but it could be done. Right, but that would be essentially the same approach as mine, but, I suspect, less efficient and more complicated. More importantly, it wouldn't be partitioning, and partitioning is what we really need within the executor. > On the flip side, what if anything can queries hope to get out of > parallel sort that they can't get out of Gather Merge? One > possibility is that a parallel sort might end up being substantially > faster than Gather Merge-over-non-parallel sort. In that case, we > obviously should prefer it. I must admit that I don't know enough about it to comment just yet. Offhand, it occurs to me that the Gather Merge sorted input could come from a number of different types of paths/nodes, whereas adopting what I've done here could only work more or less equivalently to "Gather Merge -> Sort -> Seq Scan" -- a special case, really. > For example, it's possible that you might want to have all > workers participate in sorting some data and then divide the result of > the sort into equal ranges that are again divided among the workers, > or that you might want all workers to sort and then each worker to > read a complete copy of the output data. But these don't seem like > particularly mainstream needs, nor do they necessarily seem like > problems that parallel sort itself should be trying to solve. This project of mine is about parallelizing tuplesort.c, which isn't really what you want for parallel query -- you shouldn't try to scope the problem as "make the sort more scalable using parallelism" there. Rather, you want to scope it at "make the execution of the entire query more scalable using parallelism", which is really quite a different thing, which necessarily involves the executor having direct knowledge of partition boundaries. Maybe the executor enlists tuplesort.c to help with those boundaries to some degree, but that whole thing is basically something which treats tuplesort.c as a low level primitive. > The > Volcano paper[1], one of the oldest and most-cited sources I can find > for research into parallel execution and with a design fairly similar > to our own executor, describes various variants of what they call > Exchange, of which what we now call Gather is one. I greatly respect the work of Goetz Graef, including his work on the Volcano paper. Graef has been the single biggest external influence on my work on Postgres. > They describe > another variant called Interchange, which acts like a Gather node > without terminating parallelism: every worker process reads the > complete output of an Interchange, which is the union of all rows > produced by all workers running the Interchange's input plan. That > seems like a better design than coupling such data flows specifically > to parallel sort. > > I'd like to think that parallel sort will help lots of queries, as > well as helping utility commands, but I'm not sure it will. Thoughts? You are right that I'm targeting the cases where we can get real benefits without really changing the tuplesort.h contract too much. This is literally the parallel tuplesort.c patch, which probably isn't very useful for parallel query, because the final output is always consumed serially here (this doesn't matter all that much for CREATE INDEX, I believe). This approach of mine seems like the simplest way of getting a large benefit to users involving parallelizing sorting, but I certainly don't imagine it to be the be all and end all. I have at least tried to anticipate how tuplesort.c will eventually serve the needs of partitioning for the benefit of parallel query. My intuition is that you'll have to teach it about partitioning boundaries fairly directly -- it won't do to add something generic to the executor. And, it probably won't be the only thing that needs to be taught about them. -- Peter Geoghegan
On Thu, Oct 13, 2016 at 12:35 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Oct 12, 2016 at 11:09 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> On the flip side, what if anything can queries hope to get out of >> parallel sort that they can't get out of Gather Merge? One >> possibility is that a parallel sort might end up being substantially >> faster than Gather Merge-over-non-parallel sort. In that case, we >> obviously should prefer it. > > I must admit that I don't know enough about it to comment just yet. > Offhand, it occurs to me that the Gather Merge sorted input could come > from a number of different types of paths/nodes, whereas adopting what > I've done here could only work more or less equivalently to "Gather > Merge -> Sort -> Seq Scan" -- a special case, really. > >> For example, it's possible that you might want to have all >> workers participate in sorting some data and then divide the result of >> the sort into equal ranges that are again divided among the workers, >> or that you might want all workers to sort and then each worker to >> read a complete copy of the output data. But these don't seem like >> particularly mainstream needs, nor do they necessarily seem like >> problems that parallel sort itself should be trying to solve. > > This project of mine is about parallelizing tuplesort.c, which isn't > really what you want for parallel query -- you shouldn't try to scope > the problem as "make the sort more scalable using parallelism" there. > Rather, you want to scope it at "make the execution of the entire > query more scalable using parallelism", which is really quite a > different thing, which necessarily involves the executor having direct > knowledge of partition boundaries. > Okay, but what is the proof or why do you think second is going to better than first? One thing which strikes as a major difference between your approach and Gather Merge is that in your approach leader has to wait till all the workers have done with their work on sorting whereas with Gather Merge as soon as first one is done, leader starts with merging. I could be wrong here, but if I understood it correctly, then there is a argument that Gather Merge kind of approach can win in cases where some of the workers can produce sorted outputs ahead of others and I am not sure if we can dismiss such cases. +struct Sharedsort +{ .. + * Workers increment workersFinished to indicate having finished. If + * this is equal to state.launched within the leader, leader is ready + * to merge runs. + * + * leaderDone indicates if leader is completely done (i.e., was + * tuplesort_end called against the state through which parallel output + * was consumed?) + */ + int currentWorker; + int workersFinished; .. } By looking at 'workersFinished' usage, it looks like you have devised a new way for leader to know when workers have finished which might be required for this patch. However, have you tried to use or investigate if existing infrastructure which serves same purpose could be used for it? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Okay, but what is the proof or why do you think second is going to > better than first? I don't have proof. It's my opinion that it probably would be, based on partial information, and my intuition. It's hard to prove something like that, because it's not really clear what that alternative would look like. Also, finding all of that out would take a long time -- it's hard to prototype. Do tuple table slots need to care about IndexTuples now? What does that even look like? What existing executor code needs to be taught about this new requirement? > One thing which strikes as a major difference > between your approach and Gather Merge is that in your approach leader > has to wait till all the workers have done with their work on sorting > whereas with Gather Merge as soon as first one is done, leader starts > with merging. I could be wrong here, but if I understood it > correctly, then there is a argument that Gather Merge kind of approach > can win in cases where some of the workers can produce sorted outputs > ahead of others and I am not sure if we can dismiss such cases. How can it? You need to have at least one tuple from every worker (before the worker has exhausted its supply of output tuples) in order to merge to return the next tuple to the top level consumer (the thing above the Gather Merge). If you're talking about "eager vs. lazy merging", please see my previous remarks on that, on this thread. (In any case, whether we merge more eagerly seems like an orthogonal question to the one you ask.) The first thing to note about my approach is that I openly acknowledge that this parallel CREATE INDEX patch is not much use for parallel query. I have only generalized tuplesort.c to support parallelizing a sort operation. I think that parallel query needs partitioning to push down parts of a sort to workers, with little or no need for them to be funneled together at the end, since most tuples are eliminated before being passed to the Gather/Gather Merge node. The partitioning part is really hard. I guess that Gather Merge nodes have value because they allow us to preserve the sorted-ness of a parallel path, which might be most useful because it enables things elsewhere. But, that doesn't really recommend making Gather Merge nodes good at batch processing a large number of tuples, I suspect. (One problem with the tuple queue mechanism is that it can be a big bottleneck -- that's why we want to eliminate most tuples before they're passed up to the leader, in the case of parallel sequential scan in 9.6.) I read the following paragraph from the Volcano paper just now: """ During implementation and benchmarking of parallel sorting, we added two more features to exchange. First, we wanted to implement a merge network in which some processors produce sorted streams merge concurrently by other processors. Volcano’s sort iterator can be used to generate a sorted stream. A merge iterator was easily derived from the sort module. It uses a single level merge, instead of the cascaded merge of runs used in sort. The input of a merge iterator is an exchange. Differently from other operators, the merge iterator requires to distinguish the input records by their producer. As an example, for a join operation it does not matter where the input records were created, and all inputs can be accumulated in a single input stream. For a merge operation, it is crucial to distinguish the input records by their producer in order to merge multiple sorted streams correctly. """ I don't really understand this paragraph, but thought I'd ask: why the need to "distinguish the input records by their producer in order to merge multiple sorted streams correctly"? Isn't that talking about partitioning, where each workers *ownership* of a range matters? My patch doesn't care which values belong to which workers. And, it focuses quite a lot on dealing well with the memory bandwidth bound, I/O bound part of the sort where we write out the index itself, just by piggy-backing on tuplesort.c. I don't think that that's useful for a general-purpose executor node -- tuple-at-a-time processing when fetching from workers would kill performance. > By looking at 'workersFinished' usage, it looks like you have devised > a new way for leader to know when workers have finished which might be > required for this patch. However, have you tried to use or > investigate if existing infrastructure which serves same purpose could > be used for it? Yes, I have. I think that Robert's "condition variables" patch would offer a general solution to what I've devised. What I have there is, as you say, fairly ad-hoc, even though my requirements are actually fairly general. I was actually annoyed that there wasn't an easier way to do that myself. Robert has said that he won't commit his "condition variables" work until it's clear that there will be some use for the facility. Well, I'd use it for this patch, if I could. Robert? -- Peter Geoghegan
On Mon, Oct 17, 2016 at 8:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> This project of mine is about parallelizing tuplesort.c, which isn't >> really what you want for parallel query -- you shouldn't try to scope >> the problem as "make the sort more scalable using parallelism" there. >> Rather, you want to scope it at "make the execution of the entire >> query more scalable using parallelism", which is really quite a >> different thing, which necessarily involves the executor having direct >> knowledge of partition boundaries. > > Okay, but what is the proof or why do you think second is going to > better than first? One thing which strikes as a major difference > between your approach and Gather Merge is that in your approach leader > has to wait till all the workers have done with their work on sorting > whereas with Gather Merge as soon as first one is done, leader starts > with merging. I could be wrong here, but if I understood it > correctly, then there is a argument that Gather Merge kind of approach > can win in cases where some of the workers can produce sorted outputs > ahead of others and I am not sure if we can dismiss such cases. Gather Merge can't emit a tuple unless it has buffered at least one tuple from every producer; otherwise, the next tuple it receives from one of those producers might proceed whichever tuple it chooses to emit. However, it doesn't need to wait until all of the workers are completely done. The leader only needs to be at least slightly ahead of the slowest worker. I'm not sure how that compares to Peter's approach. What I'm worried about is that we're implementing two separate systems to do the same thing, and that the parallel sort approach is actually a lot less general. I think it's possible to imagine a Parallel Sort implementation which does things Gather Merge can't. If all of the workers collaborate to sort all of the data rather than each worker sorting its own data, then you've got something which Gather Merge can't match. But this is not that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 19, 2016 at 7:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Gather Merge can't emit a tuple unless it has buffered at least one > tuple from every producer; otherwise, the next tuple it receives from > one of those producers might proceed whichever tuple it chooses to > emit. However, it doesn't need to wait until all of the workers are > completely done. The leader only needs to be at least slightly ahead > of the slowest worker. I'm not sure how that compares to Peter's > approach. I don't think that eager merging will prove all that effective, however it's implemented. I see a very I/O bound system when parallel CREATE INDEX merges serially. There is no obvious reason why you'd have a straggler worker process with CREATE INDEX, really. > What I'm worried about is that we're implementing two separate systems > to do the same thing, and that the parallel sort approach is actually > a lot less general. I think it's possible to imagine a Parallel Sort > implementation which does things Gather Merge can't. If all of the > workers collaborate to sort all of the data rather than each worker > sorting its own data, then you've got something which Gather Merge > can't match. But this is not that. It's not that yet, certainly. I think I've sketched a path forward for making partitioning a part of logtape.c that is promising. The sharing of ranges within tapes and so on will probably have a significant amount in common with what I've come up with. I don't think that any executor infrastructure is a particularly good model when *batch output* is needed -- the tuple queue mechanism will be a significant bottleneck, particularly because it does not integrate read-ahead, etc. The best case that I saw advertised for Gather Merge was TPC-H query 9 [1]. That doesn't look like a good proxy for how Gather Merge adapted to parallel CREATE INDEX would do, since it benefits from the GroupAggregate merge having many equal values, possibly with a clustering in the original tables that can naturally be exploited (no TID tiebreaker needed, since IndexTuples are not being merged). Also, it looks like Gather Merge may do that well by enabling things, rather than parallelizing the sort effectively per se. Besides, the query 9 case is significantly less scalable than good cases for this parallel CREATE INDEX patch have already been shown to be. I think I've been pretty modest about what this parallel CREATE INDEX patch gets us from the beginning. It is a generalization of tuplesort.c to work in parallel; we need a lot more for that to make things like GroupAggregate as scalable as possible, and I don't pretend that this helps much with that. There are actually more changes to nbtsort.c to coordinate all of this than there are to tuplesort.c in the latest version, so I think that this simpler approach for parallel CREATE INDEX and CLUSTER is worthwhile. The bottom line is that it's inherently difficult for me to refute the idea that Gather Merge could do just as well as what I have here, because proving that involves adding a significant amount of new infrastructure (e.g., to teach the executor about IndexTuples). I think that the argument for this basic approach is sound (it appears to offer comparable scalability to the parallel CREATE INDEX implementations of other systems), but it's simply impractical for me to offer much reassurance beyond that. [1] https://github.com/tvondra/pg_tpch/blob/master/dss/templates/9.sql -- Peter Geoghegan
On Thu, Oct 20, 2016 at 12:03 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Oct 19, 2016 at 7:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Gather Merge can't emit a tuple unless it has buffered at least one >> tuple from every producer; otherwise, the next tuple it receives from >> one of those producers might proceed whichever tuple it chooses to >> emit. Right. Now, after again looking at Gather Merge patch, I think I can better understand how it performs merging. >> However, it doesn't need to wait until all of the workers are >> completely done. The leader only needs to be at least slightly ahead >> of the slowest worker. I'm not sure how that compares to Peter's >> approach. > > I don't think that eager merging will prove all that effective, > however it's implemented. I see a very I/O bound system when parallel > CREATE INDEX merges serially. There is no obvious reason why you'd > have a straggler worker process with CREATE INDEX, really. > >> What I'm worried about is that we're implementing two separate systems >> to do the same thing, and that the parallel sort approach is actually >> a lot less general. I think it's possible to imagine a Parallel Sort >> implementation which does things Gather Merge can't. If all of the >> workers collaborate to sort all of the data rather than each worker >> sorting its own data, then you've got something which Gather Merge >> can't match. But this is not that. > > It's not that yet, certainly. I think I've sketched a path forward for > making partitioning a part of logtape.c that is promising. The sharing > of ranges within tapes and so on will probably have a significant > amount in common with what I've come up with. > > I don't think that any executor infrastructure is a particularly good > model when *batch output* is needed -- the tuple queue mechanism will > be a significant bottleneck, particularly because it does not > integrate read-ahead, etc. > Tuple queue mechanism might not be super-efficient for *batch output* (cases where many tuples needs to be read and written), but I see no reason why it will be slower than disk I/O which I think you are using in the patch. IIUC, in the patch each worker including leader does the tape sort for it's share of tuples and then finally leader merges and populates the index. I am not sure if the mechanism used in patch can be useful as compare to using tuple queue, if the workers can finish their part of sorting in-memory. > > The bottom line is that it's inherently difficult for me to refute the > idea that Gather Merge could do just as well as what I have here, > because proving that involves adding a significant amount of new > infrastructure (e.g., to teach the executor about IndexTuples). > I think, there could be a simpler way, like we can force the gather merge node when all the tuples needs to be sorted and compute the time till it merges all tuples. Similarly, with your patch, we can wait till final merge is completed. However, after doing initial study of both the patches, I feel one can construct cases where Gather Merge can win and also there will be cases where your patch can win. In particular, the Gather Merge can win where workers needs to perform sort mostly in-memory. I am not sure if it's easy to get best of both the worlds. Your patch needs rebase and I noticed one warning. sort\logtape.c(1422): warning C4700: uninitialized local variable 'lt' used -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Oct 18, 2016 at 3:48 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > I read the following paragraph from the Volcano paper just now: > > """ > During implementation and benchmarking of parallel sorting, we added > two more features to exchange. First, we wanted to implement a merge > network in which some processors produce sorted streams merge > concurrently by other processors. Volcano’s sort iterator can be used > to generate a sorted stream. A merge iterator was easily derived from > the sort module. It uses a single level merge, instead of the cascaded > merge of runs used in sort. The input of a merge iterator is an > exchange. Differently from other operators, the merge iterator > requires to distinguish the input records by their producer. As an > example, for a join operation it does not matter where the input > records were created, and all inputs can be accumulated in a single > input stream. For a merge operation, it is crucial to distinguish the > input records by their producer in order to merge multiple sorted > streams correctly. > """ > > I don't really understand this paragraph, but thought I'd ask: why the > need to "distinguish the input records by their producer in order to > merge multiple sorted streams correctly"? Isn't that talking about > partitioning, where each workers *ownership* of a range matters? > I think so, but it seems from above text that is mainly required for merge iterator which probably will be used in merge join. > My > patch doesn't care which values belong to which workers. And, it > focuses quite a lot on dealing well with the memory bandwidth bound, > I/O bound part of the sort where we write out the index itself, just > by piggy-backing on tuplesort.c. I don't think that that's useful for > a general-purpose executor node -- tuple-at-a-time processing when > fetching from workers would kill performance. > Right, but what is written in text quoted by you seems to be do-able with tuple-at-a-time processing. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Oct 21, 2016 at 4:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Oct 18, 2016 at 3:48 AM, Peter Geoghegan <pg@heroku.com> wrote: >> On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> I read the following paragraph from the Volcano paper just now: >> >> """ >> During implementation and benchmarking of parallel sorting, we added >> two more features to exchange. First, we wanted to implement a merge >> network in which some processors produce sorted streams merge >> concurrently by other processors. Volcano’s sort iterator can be used >> to generate a sorted stream. A merge iterator was easily derived from >> the sort module. It uses a single level merge, instead of the cascaded >> merge of runs used in sort. The input of a merge iterator is an >> exchange. Differently from other operators, the merge iterator >> requires to distinguish the input records by their producer. As an >> example, for a join operation it does not matter where the input >> records were created, and all inputs can be accumulated in a single >> input stream. For a merge operation, it is crucial to distinguish the >> input records by their producer in order to merge multiple sorted >> streams correctly. >> """ >> >> I don't really understand this paragraph, but thought I'd ask: why the >> need to "distinguish the input records by their producer in order to >> merge multiple sorted streams correctly"? Isn't that talking about >> partitioning, where each workers *ownership* of a range matters? >> > > I think so, but it seems from above text that is mainly required for > merge iterator which probably will be used in merge join. > >> My >> patch doesn't care which values belong to which workers. And, it >> focuses quite a lot on dealing well with the memory bandwidth bound, >> I/O bound part of the sort where we write out the index itself, just >> by piggy-backing on tuplesort.c. I don't think that that's useful for >> a general-purpose executor node -- tuple-at-a-time processing when >> fetching from workers would kill performance. >> > > Right, but what is written in text quoted by you seems to be do-able > with tuple-at-a-time processing. > To be clear, by saying above, I don't mean that we should try that approach instead of what you are proposing, but it is worth some discussion to see if that has any significant merits. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Oct 7, 2016 at 5:47 PM, Peter Geoghegan <pg@heroku.com> wrote: > Work is still needed on: > > * Cost model. Should probably attempt to guess final index size, and > derive calculation of number of workers from that. Also, I'm concerned > that I haven't given enough thought to the low end, where with default > settings most CREATE INDEX statements will use at least one parallel > worker. > > * The whole way that I teach nbtsort.c to disallow catalog tables for > parallel CREATE INDEX due to concerns about parallel safety is in need > of expert review, preferably from Robert. It's complicated in a way > that relies on things happening or not happening from a distance. > > * Heikki seems to want to change more about logtape.c, and its use of > indirection blocks. That may evolve, but for now I can only target the > master branch. > > * More extensive performance testing. I think that this V3 is probably > the fastest version yet, what with Heikki's improvements, but I > haven't really verified that. While I haven't made progress on any of these open items, I should still get a version out that applies cleanly on top of git tip -- commit b75f467b6eec0678452fd8d7f8d306e6df3a1076 caused the patch to bitrot. I attach V4, which is a fairly mechanical rebase of V3, with no notable behavioral changes or bug fixes. -- Peter Geoghegan
Attachment
On Mon, Aug 1, 2016 at 3:18 PM, Peter Geoghegan <pg@heroku.com> wrote: > Setup: > > CREATE TABLE parallel_sort_test AS > SELECT hashint8(i) randint, > md5(i::text) collate "C" padding1, > md5(i::text || '2') collate "C" padding2 > FROM generate_series(0, 1e9::bigint) i; > > CHECKPOINT; > > This leaves us with a parallel_sort_test table that is 94 GB in size. > > SET maintenance_work_mem = '8GB'; > > -- Serial case (external sort, should closely match master branch): > CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH > (parallel_workers = 0); > > Total time: 00:15:42.15 > > -- Patch with 8 tuplesort "sort-and-scan" workers (leader process > participates as a worker here): > CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH > (parallel_workers = 7); > > Total time: 00:06:03.86 > > As you can see, the parallel case is 2.58x faster I decided to revisit this exact benchmark, using the same AWS instance type (the one with 16 HDDs, again configured in software RAID0) to see how things had changed for both parallel and serial cases. I am now testing V4. A lot changed in the last 3 months, with most of the changes that help here now already committed to the master branch. Relevant changes ================ * Heikki's major overhaul of preload memory makes CREATE INDEX merging have more sequential access patterns. It also effectively allows us to use more memory. It's possible that the biggest benefit it brings to parallel CREATE INDEX is that is eliminates almost any random I/O penalty from logtape.c fragmentation that an extra merge pass has; parallel workers now usually do their own merge to produce one big run for the leader to merge. It also improves CPU cache efficiency quite directly, I think. This is the patch that helps most. Many thanks to Heikki for driving this forward. * My patch to simplify and optimize how the K-way merge heap is maintained (as tuples fill leaf pages of the final index structure) makes the merge phase significantly less CPU bound overall. (These first two items particularly help parallel CREATE INDEX, which spends proportionally much more wall clock time merging than would be expected for similar serial cases. Serial cases do of course also benefit.) * V2 of the patch (and all subsequent versions) apportioned slices of maintenance_work_mem to workers. maintenance_work_mem became a per-utility-operation budget, regardless of number of workers launched. This means that workers have less memory than the original V1 benchmark (they simply don't make use of it now), but this seems unlikely to hurt. Possibly, it even helps. * Andres' work on md.c scalability may have helped (seems unlikely with these CREATE INDEX cases that produce indexes not in the hundreds of gigabytes, though). It would help with *extremely* large index creation, which we won't really look at here. Things now look better than ever for the parallel CREATE INDEX patch. While it's typical for about 75% of wall clock time to be spent on sorting runs with serial CREATE INDEX, with the remaining 25% going on merging/writing index, with parallel CREATE INDEX I now generally see about a 50/50 split between parallel sorting of runs (including any worker merging to produce final runs) and serial merging for final on-the-fly merge where we actually write new index out as input is merged. This is a *significant* improvement over what we saw here back in August, where it was not uncommon for parallel CREATE INDEX to spend *twice* as much time in the serial final on-the-fly merge step. All improvements to the code that we've seen since August have targeted this final on-the-fly merge bottleneck. (The final on-the-fly merge is now *consistently* able to write out the index at a rate of 150MB/sec+ in my tests, which is pretty good.) New results ========== Same setup as one quoted above -- once again, we "SET maintenance_work_mem = '8GB'". -- Patch with 8 tuplesort "sort-and-scan" workers: CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH (parallel_workers = 7); Total time: 00:04:24.93 -- Serial case: CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH (parallel_workers = 0); Total time: 00:14:25.19 3.27x faster. Not bad. As you see in the quoted text, that was 2.58x back in August, even though the implementation now uses a lot less memory in parallel workers. And, that's without even considering the general question of how much faster index creation can be compared to Postgres 9.6 -- it's roughly 3.5x faster at times. New case ======== Separately, using my gensort tool [1], I came up with a new test case. The tool generated a 2.5 billion row table, sized at 159GB. This is how long is takes to produce a 73GB index on the "sortkey" column of the resulting table: -- gensort "C" locale text parallel case: CREATE INDEX test8 on sort_test(sortkey) WITH (parallel_workers = 7); Total time: 00:16:19.63 -- gensort "C" locale text serial case: CREATE INDEX test0 on sort_test(sortkey) WITH (parallel_workers = 0); Total time: 00:45:56.96 That's a 2.81x improvement in creation time relative to a serial case. Not quite as big a difference as seen in the first case, but remember that this is just like cases that were only made something like 2x - 2.2x faster by the use of parallelism back in August (see full e-mail quoted above [2]). These are cases involving a text column, or maybe a numeric column, that have complex comparators used during merging that must handle detoasting, possibly even allocate memory, etc. This second result is therefore probably the more significant of the two results shown, since it now seems like we're more consistently close to the ~3x improvement that other major database systems also seem to top out at as parallel CREATE INDEX workers are added. (I still can't see any benefit with 16 workers; my guess is that the anti-scaling begins even before the merge starts when using that many workers. That guess is hard to verify, given the confounding factor of more workers producing more runs, leaving more work for the serial merge phase.) I'd still welcome benchmarking or performance validation from somebody else. [1] https://github.com/petergeoghegan/gensort [2] https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com -- Peter Geoghegan
On Wed, Oct 19, 2016 at 11:33 AM, Peter Geoghegan <pg@heroku.com> wrote: > I don't think that eager merging will prove all that effective, > however it's implemented. I see a very I/O bound system when parallel > CREATE INDEX merges serially. There is no obvious reason why you'd > have a straggler worker process with CREATE INDEX, really. In an effort to head off any misunderstanding around this patch series, I started a new Wiki page for it: https://wiki.postgresql.org/wiki/Parallel_External_Sort This talks about parallel CREATE INDEX in particular, and uses of parallel external sort more generally, including future uses beyond CREATE INDEX. This approach worked very well for me during the UPSERT project, where a detailed overview really helped. With UPSERT, it was particularly difficult to keep the *current* state of things straight, such as current open items for the patch, areas of disagreement, and areas where there was no longer any disagreement or controversy. I don't think that this patch is even remotely as complicated as UPSERT was, but it's still something that has had several concurrently active mailing list threads (threads that are at least loosely related to the project), so I think that this will be useful. I welcome anyone with an interest in this project to review the Wiki page, add their own concerns to it with -hackers citation, and add their own content around related work. There is a kind of unresolved question around where the Gather Merge work might fit in to what I've come up with aleady. There may be other unresolved questions like that, that I'm not even aware of. I commit to maintaining the new Wiki page as a useful starting reference for understanding the current state of this patch. I hope this makes looking into the patch series less intimidating for potential reviewers. -- Peter Geoghegan
On Mon, Oct 24, 2016 at 6:17 PM, Peter Geoghegan <pg@heroku.com> wrote: >> * Cost model. Should probably attempt to guess final index size, and >> derive calculation of number of workers from that. Also, I'm concerned >> that I haven't given enough thought to the low end, where with default >> settings most CREATE INDEX statements will use at least one parallel >> worker. > While I haven't made progress on any of these open items, I should > still get a version out that applies cleanly on top of git tip -- > commit b75f467b6eec0678452fd8d7f8d306e6df3a1076 caused the patch to > bitrot. I attach V4, which is a fairly mechanical rebase of V3, with > no notable behavioral changes or bug fixes. I attach V5. Changes: * A big cost model overhaul. Workers are logarithmically scaled based on projected final *index* size, not current heap size, as was the case in V4. A new nbtpage.c routine is added to estimate a not-yet-built B-Tree index's size, now called by the optimizer. This involves getting average item width for indexed attributes from pg_attribute for the heap relation. There are some subtleties here with partial indexes, null_frac, etc. I also refined the cap applied on the number of workers that limits too many workers being launched when there isn't so much maintenance_work_mem. The cost model is much improved now -- it is now more than just a placeholder, at least. It doesn't do things like launch a totally inappropriate number of workers to build a very small partial index. Granted, those workers would still have something to do -- scan the heap -- but not enough to justify launching so many (that is, launching as many as would be launched for an equivalent non-partial index). That having been said, things are still quite fudged here, and I struggle to find any guiding principle around doing better on average. I think that that's because of the inherent difficulty of modeling what's going on, but I'd be happy to be proven wrong on that. In any case, I think it's going to be fairly common for DBAs to want to use the storage parameter to force the use of a particular number of parallel workers. (See also: my remarks below on how the new bt_estimate_nblocks() SQL-callable function can give insight into the new cost model's decisions.) * Overhauled leader_mergeruns() further, to make it closer to mergeruns(). We now always rewind input tapes. This simplification involved refining some of the assertions within logtape.c, which is also slightly simplified. * 2 new testing tools are added to the final commit in the patch series (not actually proposed for commit). I've added 2 new SQL-callable functions to contrib/pageinspect. The 2 new testing functions are: bt_estimate_nblocks ------------------- bt_estimate_nblocks() provides an easy way to see the optimizer's projection of how large the final index will be. It returns an estimate in blocks. Example: mgd=# analyze; ANALYZE mgd=# select oid::regclass as rel, bt_estimated_nblocks(oid), relpages, to_char(bt_estimated_nblocks(oid)::numeric / relpages, 'FM990.990') as estimate_actual from pg_class where relkind = 'i' order by relpages desc limit 20; rel │ bt_estimated_nblocks │ relpages │ estimate_actual ────────────────────────────────────────────────────┼──────────────────────┼──────────┼───────────────── mgd.acc_accession_idx_accid │ 107,091 │ 106,274 │ 1.008 mgd.acc_accession_0 │ 169,024 │ 106,274 │ 1.590 mgd.acc_accession_1 │ 169,024 │ 80,382 │ 2.103 mgd.acc_accession_idx_prefixpart │ 76,661 │ 80,382 │ 0.954 mgd.acc_accession_idx_mgitype_key │ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_idx_clustered │ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_idx_createdby_key │ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_idx_numericpart │ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_idx_logicaldb_key │ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_idx_modifiedby_key │ 76,661 │ 76,928 │ 0.997 mgd.acc_accession_pkey │ 76,661 │ 76,928 │ 0.997 mgd.mgi_relationship_property_idx_propertyname_key │ 74,197 │ 74,462 │ 0.996 mgd.mgi_relationship_property_idx_modifiedby_key │ 74,197 │ 74,462 │ 0.996 mgd.mgi_relationship_property_pkey │ 74,197 │ 74,462 │ 0.996 mgd.mgi_relationship_property_idx_clustered │ 74,197 │ 74,462 │ 0.996 mgd.mgi_relationship_property_idx_createdby_key │ 74,197 │ 74,462 │ 0.996 mgd.seq_sequence_idx_clustered │ 50,051 │ 50,486 │ 0.991 mgd.seq_sequence_raw_pkey │ 35,826 │ 35,952 │ 0.996 mgd.seq_sequence_raw_idx_modifiedby_key │ 35,826 │ 35,952 │ 0.996 mgd.seq_source_assoc_idx_clustered │ 35,822 │ 35,952 │ 0.996 (20 rows) I haven't tried to make the underlying logic as close to perfect as possible, but it tends to be accurate in practice, as is evident from this real-world example (this shows larger indexes following a restoration of the mouse genome sample database [1]). Perhaps there could be a role for a refined bt_estimate_nblocks() function in determining when B-Tree indexes become bloated/unbalanced (maybe have pgstatindex() estimate index bloat based on a difference between projected and actual fan-in?). That has nothing to do with parallel CREATE INDEX, though. bt_main_forks_identical ----------------------- bt_main_forks_identical() checks if 2 specific relations have bitwise identical main forks. If they do, it returns the number of blocks in the main fork of each. Otherwise, an error is raised. Unlike any approach involving *writing* the index in parallel (e.g., any worthwhile approach based on data partitioning), the proposed parallel CREATE INDEX implementation creates an identical index representation to that created by any serial process (including, for example, the master branch when CREATE INDEX uses an internal sort). The index that you end up with when parallelism is used ought to be 100% identical in all cases. (This is true because there is a TID tie-breaker when sorting B-Tree index tuples, and because LSNs are set to 0 by CREATE INDEX. Why not exploit that fact to test the implementation?) If anyone can demonstrate that parallel CREATE INDEX fails to create a non-bitwise-identical index representation to a "known good" implementation, or can demonstrate that it doesn't consistently produce exactly the same final index representation given the same underlying table as input, then there *must* be a bug. bt_main_forks_identical() gives reviewers an easy way to verify this, perhaps just in passing during benchmarking. pg_restore ========== It occurs to me that parallel CREATE INDEX receives no special consideration by pg_restore. This leaves it so that the use of parallel CREATE INDEX can come down to whether or not pg_class.reltuples is accidentally updated by something like an initial CREATE INDEX. This is not ideal. There is also the questions of how pg_restore -j cases ought to give special consideration to parallel CREATE INDEX, if at all -- it's probably true that concurrent index builds on the same relation do go together well with parallel CREATE INDEX, but even in V5 pg_restore remains totally naive about this. That having been said, pg_restore currently does nothing clever with maintenance_work_mem when multiple jobs are used, even though that seems at least as useful as what I outline for parallel CREATE INDEX. It's not clear how to judge this. What do we need to teach pg_restore about parallel CREATE INDEX, if anything at all? Could this be as simple as a blanket disabling of parallelism for CREATE INDEX from pg_restore? Or, does it need to be more sophisticated than that? I suppose that tools like reindexdb and pgbench must be considered in a similar way. Maybe we could get the number of blocks in the heap relation when its pg_class.reltupes is 0, from the smgr, and then extrapolate the reltuples using simple, generic logic, in the style of vac_estimate_reltuples() (its "old_rel_pages" == 0 case). For now, I've avoided doing that out of concern for the overhead in cases where there are many small tables to be restored, and because it may be better to err on the side of not using parallelism. [1] https://wiki.postgresql.org/wiki/Sample_Databases -- Peter Geoghegan
Attachment
On Mon, Nov 7, 2016 at 11:28 PM, Peter Geoghegan <pg@heroku.com> wrote: > I attach V5. I gather that 0001, which puts a cap on the number of tapes, is not actually related to the subject of this thread; it's an independent change that you think is a good idea. I reviewed the previous discussion on this topic upthread, between you and Heikki, which seems to me to contain more heat than light. At least in my opinion, the question is not whether a limit on the number of tapes is the best possible system, but rather whether it's better than the status quo. It's silly to refuse to make a simple change on the grounds that some much more complex change might be better, because if somebody writes that patch and it is better we can always revert 0001 then. If 0001 involved hundreds of lines of invasive code changes, that argument wouldn't apply, but it doesn't; it's almost a one-liner. Now, on the other hand, as far as I can see, the actual amount of evidence that 0001 is a good idea which has been presented in this forum is pretty near zero. You've argued for it on theoretical grounds several times, but theoretical arguments are not a substitute for test results. Therefore, I decided that the best thing to do was test it myself. I wrote a little patch to add a GUC for max_sort_tapes, which actually turns out not to work as I thought: setting max_sort_tapes = 501 seems to limit the highest tape number to 501 rather than the number of tapes to 501, so there's a sort of off-by-one error. But that doesn't really matter. The patch is attached here for the convenience of anyone else who may want to fiddle with this. Next, I tried to set things up so that I'd get a large enough number of tapes for the cap to matter. To do that, I initialized with "pgbench -i --unlogged-tables -s 20000" so that I had 2 billion tuples. Then I used this SQL query: "select sum(w+abalance) from (select (aid::numeric * 7123000217)%1000000000 w, * from pgbench_accounts order by 1) x". The point of the math is to perturb the ordering of the tuples so that they actually need to be sorted instead of just passed through unchanged. The use of abalance in the outer sum prevents an index-only-scan from being used, which makes the sort wider; perhaps I should have tried to make it wider still, but this is what I did. I wanted to have more than 501 tapes because, obviously, a concern with a change like this is that things might get slower in the case where it forces a polyphase merge rather than a single merge pass. And, of course, I set trace_sort = on. Here's what my initial run looked like, in brief: 2016-11-09 15:37:52 UTC [44026] LOG: begin tuple sort: nkeys = 1, workMem = 262144, randomAccess = f 2016-11-09 15:37:59 UTC [44026] LOG: switching to external sort with 937 tapes: CPU: user: 5.51 s, system: 0.27 s, elapsed: 6.56 s 2016-11-09 16:48:31 UTC [44026] LOG: finished writing run 616 to tape 615: CPU: user: 4029.17 s, system: 152.72 s, elapsed: 4238.54 s 2016-11-09 16:48:31 UTC [44026] LOG: using 246719 KB of memory for read buffers among 616 input tapes 2016-11-09 16:48:39 UTC [44026] LOG: performsort done (except 616-way final merge): CPU: user: 4030.30 s, system: 152.98 s, elapsed: 4247.41 s 2016-11-09 18:33:30 UTC [44026] LOG: external sort ended, 6255145 disk blocks used: CPU: user: 10214.64 s, system: 175.24 s, elapsed: 10538.06 s And according to psql: Time: 10538068.225 ms (02:55:38.068) Then I set max_sort_tapes = 501 and ran it again. This time: 2016-11-09 19:05:22 UTC [44026] LOG: begin tuple sort: nkeys = 1, workMem = 262144, randomAccess = f 2016-11-09 19:05:28 UTC [44026] LOG: switching to external sort with 502 tapes: CPU: user: 5.69 s, system: 0.26 s, elapsed: 6.13 s 2016-11-09 20:15:20 UTC [44026] LOG: finished writing run 577 to tape 75: CPU: user: 3993.81 s, system: 153.42 s, elapsed: 4198.52 s 2016-11-09 20:15:20 UTC [44026] LOG: using 249594 KB of memory for read buffers among 501 input tapes 2016-11-09 20:21:19 UTC [44026] LOG: finished 77-way merge step: CPU: user: 4329.50 s, system: 160.67 s, elapsed: 4557.22 s 2016-11-09 20:21:19 UTC [44026] LOG: performsort done (except 501-way final merge): CPU: user: 4329.50 s, system: 160.67 s, elapsed: 4557.22 s 2016-11-09 21:38:12 UTC [44026] LOG: external sort ended, 6255484 disk blocks used: CPU: user: 8848.81 s, system: 182.64 s, elapsed: 9170.62 s And this one, according to psql: Time: 9170629.597 ms (02:32:50.630) That looks very good. On a test that runs for almost 3 hours, we saved more than 20 minutes. The overall runtime improvement is 23% in a case where we would not expect this patch to do particularly well; after all, without limiting the number of runs, we are able to complete the sort with a single merge pass, whereas when we reduce the number of runs, we now require a polyphase merge. Nevertheless, we come out way ahead, because the final merge pass gets way faster, presumably because there are fewer tapes involved. The first test does a 616-way final merge and takes 6184.34 seconds to do it. The second test does a 501-way final merge and takes 4519.31 seconds to do. This increased final merge speed accounts for practically all of the speedup, and the reason it's faster pretty much has to be that it's merging fewer tapes. That, in turn, happens for two reasons. First, because limiting the number of tapes increases slightly the memory available for storing the tuples belonging to each run, we end up with fewer runs in the first place. The number of runs drops from from 616 to 577, about a 7% reduction. Second, because we have more runs than tapes in the second case, it does a 77-way merge prior to the final merge. Because of that 77-way merge, the time at which the second run starts producing tuples is slightly later. Instead of producing the first tuple at 70:47.71, we have to wait until 75:72.22. That's a small disadvantage in this case, because it's hypothetically possible that a query like this could have a LIMIT and we'd end up worse off overall. However, that's pretty unlikely, for three reasons. Number one, LIMIT isn't likely to be used on queries of this type in the first place. Number two, if it were used, we'd probably end up with a bounded sort plan which would be way faster anyway. Number three, if somehow we still sorted the data set we'd still win in this case if the limit were more than about 20% of the total number of tuples. The much faster run time to produce the whole data set is a small price to pay for possibly needing to wait a little longer for the first tuple. Admittedly, this is only one test, and some other test might show a different result. However, I believe that there aren't likely to be many losing cases. If the increased number of tapes doesn't force a polyphase merge, we're almost certain to win, because in that case the only thing that changes is that we have more memory with which to produce each run. On small sorts, this may not help much, but it won't hurt. Even if the increased number of tapes *does* force a polyphase merge, the reduction in the number of initial runs and/or the reduction in the number of runs in any single merge may add up to a win, as in this example. In fact, it may well be the case that the optimal number of tapes is significantly less than 501. It's hard to tell for sure, but it sure looks like that 77-way non-final merge is significantly more efficient than the final merge. So, I'm now feeling pretty bullish about this patch, except for one thing, which is that I think the comments are way off-base. Peter writes: $$When allowedMem is significantly lower than what is required for an internal sort, it is unlikely that there are benefits to increasing the number of tapes beyond Knuth's "sweet spot" of 7.$$ I'm pretty sure that's totally wrong, first of all because commit df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing precisely the thing which this comment says we shouldn't, secondly because 501 is most definitely significantly higher than 7 so the code and the comment don't even match, and thirdly because, as the comment added in the commit says, each extra tape doesn't really cost that much. In this example, going from 501 tapes up to 937 tapes only reduces memory available for tuples by about 7%, even though the number of tapes have almost doubled. If we had a sort with, say, 30 runs, do we really want to do a polyphase merge just to get a sub-1% increase in the amount of memory per run? I doubt it. Given all that, what I'm inclined to do is rewrite the comment to say, basically, that even though we can afford lots of tapes, it's better not to allow too ridiculously many because (1) that eats away at the amount of memory available for tuples in each initial run and (2) very high-order final merges are not very efficient. And then commit that. If somebody wants to fine-tune the tape limit later after more extensive testing or replacing it by some other system that is better, great. Sound OK? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Nov 9, 2016 at 4:01 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I gather that 0001, which puts a cap on the number of tapes, is not > actually related to the subject of this thread; it's an independent > change that you think is a good idea. I reviewed the previous > discussion on this topic upthread, between you and Heikki, which seems > to me to contain more heat than light. FWIW, I don't remember it that way. Heikki seemed to be uncomfortable with the quasi-arbitrary choice of constant, rather than disagreeing with the general idea of a cap. Or, maybe he thought I didn't go far enough, by completely removing polyphase merge. I think that removing polyphase merge would be an orthogonal change to this, though. > Now, on the other hand, as far as I can see, the actual amount of > evidence that 0001 is a good idea which has been presented in this > forum is pretty near zero. You've argued for it on theoretical > grounds several times, but theoretical arguments are not a substitute > for test results. See the illustration in TAOCP, vol III, page 273 in the second edition -- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think that it's actually a real-world benchmark. I guess I felt that no one ever argued that as many tapes as possible was sound on any grounds, even theoretical, and so didn't feel obligated to test it until asked to do so. I think that the reason that a cap like this didn't go in around the time that the growth logic went in (2006) was because nobody followed up on it. If you look at the archives, there is plenty of discussion of a cap like this at the time. > That looks very good. On a test that runs for almost 3 hours, we > saved more than 20 minutes. The overall runtime improvement is 23% in > a case where we would not expect this patch to do particularly well; > after all, without limiting the number of runs, we are able to > complete the sort with a single merge pass, whereas when we reduce the > number of runs, we now require a polyphase merge. Nevertheless, we > come out way ahead, because the final merge pass gets way faster, > presumably because there are fewer tapes involved. The first test > does a 616-way final merge and takes 6184.34 seconds to do it. The > second test does a 501-way final merge and takes 4519.31 seconds to > do. This increased final merge speed accounts for practically all of > the speedup, and the reason it's faster pretty much has to be that > it's merging fewer tapes. It's CPU cache efficiency -- has to be. > That, in turn, happens for two reasons. First, because limiting the > number of tapes increases slightly the memory available for storing > the tuples belonging to each run, we end up with fewer runs in the > first place. The number of runs drops from from 616 to 577, about a > 7% reduction. Second, because we have more runs than tapes in the > second case, it does a 77-way merge prior to the final merge. Because > of that 77-way merge, the time at which the second run starts > producing tuples is slightly later. Instead of producing the first > tuple at 70:47.71, we have to wait until 75:72.22. That's a small > disadvantage in this case, because it's hypothetically possible that a > query like this could have a LIMIT and we'd end up worse off overall. > However, that's pretty unlikely, for three reasons. Number one, LIMIT > isn't likely to be used on queries of this type in the first place. > Number two, if it were used, we'd probably end up with a bounded sort > plan which would be way faster anyway. Number three, if somehow we > still sorted the data set we'd still win in this case if the limit > were more than about 20% of the total number of tuples. The much > faster run time to produce the whole data set is a small price to pay > for possibly needing to wait a little longer for the first tuple. Cool. > So, I'm now feeling pretty bullish about this patch, except for one > thing, which is that I think the comments are way off-base. Peter > writes: $When allowedMem is significantly lower than what is required > for an internal sort, it is unlikely that there are benefits to > increasing the number of tapes beyond Knuth's "sweet spot" of 7.$ > I'm pretty sure that's totally wrong, first of all because commit > df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing > precisely the thing which this comment says we shouldn't It's more complicated than that. As I said, I think that Knuth basically had it right with his sweet spot of 7. I think that commit df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part because a one-pass merge avoided certain overheads not inherent to polyphase merge, like all that memory accounting stuff, extra palloc() traffic, etc. The expanded use of per tape buffering we have even in multi-pass cases likely makes that much less true for us these days. The reason I haven't actually gone right back down to 7 with this cap is that it's possible that the added I/O costs outweigh the CPU costs in extreme cases, even though I think that polyphase merge doesn't have all that much to do with I/O costs, even with its 1970s perspective. Knuth doesn't say much about I/O costs -- it's more about using an extremely small amount of memory effectively (minimizing CPU costs with very little available main memory). Furthermore, not limiting ourselves to 7 tapes and seeing a benefit (benefitting from a few dozen or hundred instead) seems more possible with the improved merge heap maintenance logic added recently, where there could be perhaps hundreds of runs merged with very low CPU cost in the event of presorted input (or, input that is inversely logically/physically correlated). That would be true because we'd only examine the top of the heap through, and so I/O costs may matter much more. Depending on the exact details, I bet you could see a benefit with only 7 tapes due to CPU cache efficiency in a case like the one you describe. Perhaps when sorting integers, but not when sorting collated text. There are many competing considerations, which I've tried my best to balance here with a merge order of 500. > Sound OK? I'm fine with not mentioning Knuth's sweet spot once more. I guess it's not of much practical value that he was on to something with that. I realize, on reflection, that my understanding of what's going on is very nuanced. Thanks -- Peter Geoghegan
On Wed, Nov 9, 2016 at 4:54 PM, Peter Geoghegan <pg@heroku.com> wrote: > It's more complicated than that. As I said, I think that Knuth > basically had it right with his sweet spot of 7. I think that commit > df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part > because a one-pass merge avoided certain overheads not inherent to > polyphase merge, like all that memory accounting stuff, extra palloc() > traffic, etc. The expanded use of per tape buffering we have even in > multi-pass cases likely makes that much less true for us these days. Also, logtape.c fragmentation made multiple merge pass cases experience increased random I/O in a way that was only an accident of our implementation. We've fixed that now, but that problem must have added further cost that df700e6b40195d28dc764e0c694ac8cef90d4638 *masked* when it was commited in 2006. (I do think that the problem with the merge heap maintenance fixed recently in 24598337c8d214ba8dcf354130b72c49636bba69 was the biggest problem that the 2006 work masked, though). -- Peter Geoghegan
On Wed, Nov 9, 2016 at 7:54 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Now, on the other hand, as far as I can see, the actual amount of >> evidence that 0001 is a good idea which has been presented in this >> forum is pretty near zero. You've argued for it on theoretical >> grounds several times, but theoretical arguments are not a substitute >> for test results. > > See the illustration in TAOCP, vol III, page 273 in the second edition > -- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think > that it's actually a real-world benchmark. I don't have that publication, and I'm guessing that's not based on PostgreSQL's implementation. There's no substitute for tests using the code we've actually got. >> So, I'm now feeling pretty bullish about this patch, except for one >> thing, which is that I think the comments are way off-base. Peter >> writes: $When allowedMem is significantly lower than what is required >> for an internal sort, it is unlikely that there are benefits to >> increasing the number of tapes beyond Knuth's "sweet spot" of 7.$ >> I'm pretty sure that's totally wrong, first of all because commit >> df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing >> precisely the thing which this comment says we shouldn't > > It's more complicated than that. As I said, I think that Knuth > basically had it right with his sweet spot of 7. I think that commit > df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part > because a one-pass merge avoided certain overheads not inherent to > polyphase merge, like all that memory accounting stuff, extra palloc() > traffic, etc. The expanded use of per tape buffering we have even in > multi-pass cases likely makes that much less true for us these days. > > The reason I haven't actually gone right back down to 7 with this cap > is that it's possible that the added I/O costs outweigh the CPU costs > in extreme cases, even though I think that polyphase merge doesn't > have all that much to do with I/O costs, even with its 1970s > perspective. Knuth doesn't say much about I/O costs -- it's more about > using an extremely small amount of memory effectively (minimizing CPU > costs with very little available main memory). > > Furthermore, not limiting ourselves to 7 tapes and seeing a benefit > (benefitting from a few dozen or hundred instead) seems more possible > with the improved merge heap maintenance logic added recently, where > there could be perhaps hundreds of runs merged with very low CPU cost > in the event of presorted input (or, input that is inversely > logically/physically correlated). That would be true because we'd only > examine the top of the heap through, and so I/O costs may matter much > more. > > Depending on the exact details, I bet you could see a benefit with > only 7 tapes due to CPU cache efficiency in a case like the one you > describe. Perhaps when sorting integers, but not when sorting collated > text. There are many competing considerations, which I've tried my > best to balance here with a merge order of 500. I guess that's possible, but the problem with polyphase merge is that the increased I/O becomes a pretty significant cost in a hurry. Here's the same test with max_sort_tapes = 100: 2016-11-09 23:02:49 UTC [48551] LOG: begin tuple sort: nkeys = 1, workMem = 262144, randomAccess = f 2016-11-09 23:02:55 UTC [48551] LOG: switching to external sort with 101 tapes: CPU: user: 5.72 s, system: 0.25 s, elapsed: 6.04 s 2016-11-10 00:13:00 UTC [48551] LOG: finished writing run 544 to tape 49: CPU: user: 4003.00 s, system: 156.89 s, elapsed: 4211.33 s 2016-11-10 00:16:52 UTC [48551] LOG: finished 51-way merge step: CPU: user: 4214.84 s, system: 161.94 s, elapsed: 4442.98 s 2016-11-10 00:25:41 UTC [48551] LOG: finished 100-way merge step: CPU: user: 4704.14 s, system: 170.83 s, elapsed: 4972.47 s 2016-11-10 00:36:47 UTC [48551] LOG: finished 99-way merge step: CPU: user: 5333.12 s, system: 179.94 s, elapsed: 5638.52 s 2016-11-10 00:45:32 UTC [48551] LOG: finished 99-way merge step: CPU: user: 5821.13 s, system: 189.00 s, elapsed: 6163.53 s 2016-11-10 01:01:29 UTC [48551] LOG: finished 100-way merge step: CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58 s 2016-11-10 01:01:29 UTC [48551] LOG: performsort done (except 100-way final merge): CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58 s 2016-11-10 01:45:40 UTC [48551] LOG: external sort ended, 6255949 disk blocks used: CPU: user: 9271.07 s, system: 232.26 s, elapsed: 9771.49 s This is already worse than max_sort_tapes = 501, though the total runtime is still better than no cap (the time-to-first-tuple is way worse, though). I'm going to try max_sort_tapes = 10 next, but I think the basic pattern is already fairly clear. As you reduce the cap on the number of tapes, (a) the time to build the initial runs doesn't change very much, (b) the time to perform the final merge decreases significantly, and (c) the time to perform the non-final merges increases even faster. In this particular test configuration on this particular hardware, rewriting 77 tapes in the 501-tape configuration wasn't too bad, but now that we're down to 100 tapes, we have to rewrite 449 tapes out of a total of 544, and that's actually a loss: rewriting the bulk of your data an extra time to save on cache misses doesn't pay. It would probably be even less good if there were other concurrent activity on the system. It's possible that if your polyphase merge is actually being done all in memory, cache efficiency might remain the dominant consideration, but I think we should assume that a polyphase merge is doing actual I/O, because it's sort of pointless to use that algorithm in the first place if there's no real I/O involved. At the moment, at least, it looks to me as though we don't need to be afraid of a *little* bit of polyphase merging, but a *lot* of polyphase merging is actually pretty bad. In other words, by imposing a limit of the number of tapes, we're going to improve sorts that are smaller than work_mem * num_tapes * ~1.5 -- because cache efficiency will be better -- but above that things will probably get worse because of the increased I/O cost. From that point of view, a 500-tape limit is the same as saying that it's we don't think it's entirely reasonable to try to perform a sort that exceeds work_mem by a factor of more than ~750, whereas a 7-tape limit is the same as saying that we don't think it's entirely reasonable to perform a sort that exceeds work_mem by a factor of more than ~10. That latter proposition seems entirely untenable. Our default work_mem setting is 4MB, and people will certainly expect to be able to get away with, say, an 80MB sort without changing settings. On the other hand, if they're sorting more than 3GB with work_mem = 4MB, I think we'll be justified in making a gentle suggestion that they reconsider that setting. Among other arguments, it's going to be pretty slow in that case no matter what we do here. Maybe another way of putting this is that, while there's clearly a benefit to having some kind of a cap, it's appropriate to pick a large value, such as 500. Having no cap at all risks creating many extra tapes that just waste memory, and also risks an unduly cache-inefficient final merge. Reigning that in makes sense. However, we can't reign it in too far or we'll create slow polyphase merges in case that are reasonably likely to occur in real life. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 9, 2016 at 6:57 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I guess that's possible, but the problem with polyphase merge is that > the increased I/O becomes a pretty significant cost in a hurry. Not if you have a huge RAID array. :-) Obviously I'm not seriously suggesting that we revise the cap from 500 to 7. We're only concerned about the constant factors here. There is a clearly a need to make some simplifying assumptions. I think that you understand this very well, though. > Maybe another way of putting this is that, while there's clearly a > benefit to having some kind of a cap, it's appropriate to pick a large > value, such as 500. Having no cap at all risks creating many extra > tapes that just waste memory, and also risks an unduly > cache-inefficient final merge. Reigning that in makes sense. > However, we can't reign it in too far or we'll create slow polyphase > merges in case that are reasonably likely to occur in real life. I completely agree with your analysis. -- Peter Geoghegan
On Wed, Nov 9, 2016 at 10:18 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Maybe another way of putting this is that, while there's clearly a >> benefit to having some kind of a cap, it's appropriate to pick a large >> value, such as 500. Having no cap at all risks creating many extra >> tapes that just waste memory, and also risks an unduly >> cache-inefficient final merge. Reigning that in makes sense. >> However, we can't reign it in too far or we'll create slow polyphase >> merges in case that are reasonably likely to occur in real life. > > I completely agree with your analysis. Cool. BTW, my run with 10 tapes completed in 10696528.377 ms (02:58:16.528) - i.e. almost 3 minutes slower than with no tape limit. Building runs took 4260.16 s, and the final merge pass began at 8239.12 s. That's certainly better than I expected, and it seems to show that even if the number of tapes is grossly inadequate for the number of runs, you can still make up most of the time that you lose to I/O with improved cache efficiency -- at least under favorable circumstances. Of course, on many systems I/O bandwidth will be a scarce resource, so that argument can be overdone -- and even if not, a 10-tape sort version takes FAR longer to deliver the first tuple. I also tried this out with work_mem = 512MB. Doubling work_mem reduces the number of runs enough that we don't get a polyphase merge in any case. With no limit on tapes: 2016-11-10 11:24:45 UTC [54042] LOG: switching to external sort with 1873 tapes: CPU: user: 11.34 s, system: 0.48 s, elapsed: 12.13 s 2016-11-10 12:36:22 UTC [54042] LOG: finished writing run 308 to tape 307: CPU: user: 4096.63 s, system: 156.88 s, elapsed: 4309.66 s 2016-11-10 12:36:22 UTC [54042] LOG: using 516563 KB of memory for read buffers among 308 input tapes 2016-11-10 12:36:30 UTC [54042] LOG: performsort done (except 308-way final merge): CPU: user: 4097.75 s, system: 157.24 s, elapsed: 4317.76 s 2016-11-10 13:54:07 UTC [54042] LOG: external sort ended, 6255577 disk blocks used: CPU: user: 8638.72 s, system: 177.42 s, elapsed: 8974.44 s With a max_sort_tapes = 501: 2016-11-10 14:23:50 UTC [54042] LOG: switching to external sort with 502 tapes: CPU: user: 10.99 s, system: 0.54 s, elapsed: 11.57 s 2016-11-10 15:36:47 UTC [54042] LOG: finished writing run 278 to tape 277: CPU: user: 4190.31 s, system: 155.33 s, elapsed: 4388.86 s 2016-11-10 15:36:47 UTC [54042] LOG: using 517313 KB of memory for read buffers among 278 input tapes 2016-11-10 15:36:54 UTC [54042] LOG: performsort done (except 278-way final merge): CPU: user: 4191.36 s, system: 155.68 s, elapsed: 4395.66 s 2016-11-10 16:53:39 UTC [54042] LOG: external sort ended, 6255699 disk blocks used: CPU: user: 8673.07 s, system: 175.93 s, elapsed: 9000.80 s 0.3% slower with the tape limit, but that might be noise. Even if not, it seems pretty silly to create 1873 tapes when we only need ~300. At work_mem = 2GB: 2016-11-10 18:08:00 UTC [54042] LOG: switching to external sort with 7490 tapes: CPU: user: 44.28 s, system: 1.99 s, elapsed: 46.33 s 2016-11-10 19:23:06 UTC [54042] LOG: finished writing run 77 to tape 76: CPU: user: 4342.10 s, system: 156.21 s, elapsed: 4551.95 s 2016-11-10 19:23:06 UTC [54042] LOG: using 2095202 KB of memory for read buffers among 77 input tapes 2016-11-10 19:23:12 UTC [54042] LOG: performsort done (except 77-way final merge): CPU: user: 4343.36 s, system: 157.07 s, elapsed: 4558.79 s 2016-11-10 20:24:24 UTC [54042] LOG: external sort ended, 6255946 disk blocks used: CPU: user: 7894.71 s, system: 176.36 s, elapsed: 8230.13 s At work_mem = 2GB, max_sort_tapes = 501: 2016-11-10 21:28:23 UTC [54042] LOG: switching to external sort with 502 tapes: CPU: user: 44.09 s,system: 1.94 s, elapsed: 46.07 s 2016-11-10 22:42:28 UTC [54042] LOG: finished writing run 68 to tape 67: CPU: user: 4278.49 s, system: 154.39 s, elapsed: 4490.25 s 2016-11-10 22:42:28 UTC [54042] LOG: using 2095427 KB of memory for read buffers among 68 input tapes 2016-11-10 22:42:34 UTC [54042] LOG: performsort done (except 68-way final merge): CPU: user: 4279.60 s, system: 155.21 s, elapsed: 4496.83 s 2016-11-10 23:42:10 UTC [54042] LOG: external sort ended, 6255983 disk blocks used: CPU: user: 7733.98 s, system: 173.85 s, elapsed: 8072.55 s Roughly 2% faster. Maybe still noise, but less likely. 7490 tapes certainly seems over the top. At work_mem = 8GB: 2016-11-14 19:17:28 UTC [54042] LOG: switching to external sort with 29960 tapes: CPU: user: 183.80 s, system: 7.71 s, elapsed: 191.61 s 2016-11-14 20:32:02 UTC [54042] LOG: finished writing run 20 to tape 19: CPU: user: 4431.44 s, system: 176.82 s, elapsed: 4665.16 s 2016-11-14 20:32:02 UTC [54042] LOG: using 8388083 KB of memory for read buffers among 20 input tapes 2016-11-14 20:32:26 UTC [54042] LOG: performsort done (except 20-way final merge): CPU: user: 4432.99 s, system: 181.29 s, elapsed: 4689.52 s 2016-11-14 21:30:56 UTC [54042] LOG: external sort ended, 6256003 disk blocks used: CPU: user: 7835.83 s, system: 199.01 s, elapsed: 8199.29 s At work_mem = 8GB, max_sort_tapes = 501: 2016-11-14 21:52:43 UTC [54042] LOG: switching to external sort with 502 tapes: CPU: user: 181.08 s, system: 7.66 s, elapsed: 189.05 s 2016-11-14 23:06:06 UTC [54042] LOG: finished writing run 17 to tape 16: CPU: user: 4381.56 s, system: 161.82 s, elapsed: 4591.63 s 2016-11-14 23:06:06 UTC [54042] LOG: using 8388158 KB of memory for read buffers among 17 input tapes 2016-11-14 23:06:36 UTC [54042] LOG: performsort done (except 17-way final merge): CPU: user: 4383.45 s, system: 165.32 s, elapsed: 4622.04 s 2016-11-14 23:54:00 UTC [54042] LOG: external sort ended, 6256002 disk blocks used: CPU: user: 7124.49 s, system: 182.16 s, elapsed: 7466.18 s Roughly 9% faster. Building runs seems to be very slowly degrading as we increase work_mem, but the final merge is speeding up somewhat more quickly. Intuitively that makes sense to me: if merging were faster than quicksorting, we could just merge-sort all the time instead of using quicksort for internal sorts. Also, we've got 29960 tapes now, better than three orders of magnitude more than what we actually need. At this work_mem setting, 501 tapes is enough to efficiently sort at least 4TB of data and quite possibly a good bit more. So, committed 0001, with comment changes along the lines I proposed before. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Nov 7, 2016 at 8:28 PM, Peter Geoghegan <pg@heroku.com> wrote: > What do we need to teach pg_restore about parallel CREATE INDEX, if > anything at all? Could this be as simple as a blanket disabling of > parallelism for CREATE INDEX from pg_restore? Or, does it need to be > more sophisticated than that? I suppose that tools like reindexdb and > pgbench must be considered in a similar way. I still haven't resolved this question, which seems like the most important outstanding question, but I attach V6. Changes: * tuplesort.c was adapted to use the recently committed condition variables stuff. This made things cleaner. No more ad-hoc WaitLatch() looping. * Adapted docs to mention the newly committed max_parallel_workers GUC in the context of discussing proposed max_parallel_workers_maintenance GUC. * Fixed trivial assertion failure bug that could be tripped when a conventional sort uses very little memory. -- Peter Geoghegan
Attachment
Peter Geoghegan wrote: > On Mon, Nov 7, 2016 at 8:28 PM, Peter Geoghegan <pg@heroku.com> wrote: > > What do we need to teach pg_restore about parallel CREATE INDEX, if > > anything at all? Could this be as simple as a blanket disabling of > > parallelism for CREATE INDEX from pg_restore? Or, does it need to be > > more sophisticated than that? I suppose that tools like reindexdb and > > pgbench must be considered in a similar way. > > I still haven't resolved this question, which seems like the most > important outstanding question, I don't think a patch must necessarily consider all possible uses that the new feature may have. If we introduce parallel index creation, that's great; if pg_restore doesn't start using it right away, that's okay. You, or somebody else, can still patch it later. The patch is still a step forward. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Dec 3, 2016 at 5:45 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I don't think a patch must necessarily consider all possible uses that > the new feature may have. If we introduce parallel index creation, > that's great; if pg_restore doesn't start using it right away, that's > okay. You, or somebody else, can still patch it later. The patch is > still a step forward. While I agree, right now pg_restore will tend to use or not use parallelism for CREATE INDEX more or less by accident, based on whether or not pg_class.reltuples has already been set by something else (e.g., an earlier CREATE INDEX against the same table in the restoration). That seems unacceptable. I haven't just suppressed the use of parallel CREATE INDEX within pg_restore because that would be taking a position on something I have a hard time defending any particular position on. And so, I am slightly concerned about the entire ecosystem of tools that could implicitly use parallel CREATE INDEX, with undesirable consequences. Especially pg_restore. It's not so much a hard question as it is an awkward one. I want to handle any possible objection about there being future compatibility issues with going one way or the other ("This paints us into a corner with..."). And, there is no existing, simple way for pg_restore and other tools to disable the use of parallelism due to the cost model automatically kicking in, while still allowing the proposed new index storage parameter ("parallel_workers") to force the use of parallelism, which seems like something that should happen. (I might have to add a new GUC like "enable_maintenance_paralleism", since "max_parallel_workers_maintenance = 0" disables parallelism no matter how it might be invoked). In general, I have a positive outlook on this patch, since it appears to compete well with similar implementations in other systems scalability-wise. It does what it's supposed to do. -- Peter Geoghegan
On Sat, 2016-12-03 at 18:37 -0800, Peter Geoghegan wrote: > On Sat, Dec 3, 2016 at 5:45 PM, Alvaro Herrera <alvherre@2ndquadrant. > com> wrote: > > > > I don't think a patch must necessarily consider all possible uses > > that > > the new feature may have. If we introduce parallel index creation, > > that's great; if pg_restore doesn't start using it right away, > > that's > > okay. You, or somebody else, can still patch it later. The patch > > is > > still a step forward. > While I agree, right now pg_restore will tend to use or not use > parallelism for CREATE INDEX more or less by accident, based on > whether or not pg_class.reltuples has already been set by something > else (e.g., an earlier CREATE INDEX against the same table in the > restoration). That seems unacceptable. I haven't just suppressed the > use of parallel CREATE INDEX within pg_restore because that would be > taking a position on something I have a hard time defending any > particular position on. And so, I am slightly concerned about the > entire ecosystem of tools that could implicitly use parallel CREATE > INDEX, with undesirable consequences. Especially pg_restore. > > It's not so much a hard question as it is an awkward one. I want to > handle any possible objection about there being future compatibility > issues with going one way or the other ("This paints us into a corner > with..."). And, there is no existing, simple way for pg_restore and > other tools to disable the use of parallelism due to the cost model > automatically kicking in, while still allowing the proposed new index > storage parameter ("parallel_workers") to force the use of > parallelism, which seems like something that should happen. (I might > have to add a new GUC like "enable_maintenance_paralleism", since > "max_parallel_workers_maintenance = 0" disables parallelism no matter > how it might be invoked). I do share your concerns about unpredictable behavior - that's particularly worrying for pg_restore, which may be used for time- sensitive use cases (DR, migrations between versions), so unpredictable changes in behavior / duration are unwelcome. But isn't this more a deficiency in pg_restore, than in CREATE INDEX? The issue seems to be that the reltuples value may or may not get updated, so maybe forcing ANALYZE (even very low statistics_target values would do the trick, I think) would be more appropriate solution? Or maybe it's time add at least some rudimentary statistics into the dumps (the reltuples field seems like a good candidate). Trying to fix this by adding more GUCs seems a bit strange to me. > > In general, I have a positive outlook on this patch, since it appears > to compete well with similar implementations in other systems > scalability-wise. It does what it's supposed to do. > +1 to that -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Dec 3, 2016 at 7:23 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I do share your concerns about unpredictable behavior - that's > particularly worrying for pg_restore, which may be used for time- > sensitive use cases (DR, migrations between versions), so unpredictable > changes in behavior / duration are unwelcome. Right. > But isn't this more a deficiency in pg_restore, than in CREATE INDEX? > The issue seems to be that the reltuples value may or may not get > updated, so maybe forcing ANALYZE (even very low statistics_target > values would do the trick, I think) would be more appropriate solution? > Or maybe it's time add at least some rudimentary statistics into the > dumps (the reltuples field seems like a good candidate). I think that there is a number of reasonable ways of looking at it. It might also be worthwhile to have a minimal ANALYZE performed by CREATE INDEX directly, iff there are no preexisting statistics (there is definitely going to be something pg_restore-like that we cannot fix -- some ETL tool, for example). Perhaps, as an additional condition to proceeding with such an ANALYZE, it should also only happen when there is any chance at all of parallelism being used (but then you get into having to establish the relation size reliably in the absence of any pg_class.relpages, which isn't very appealing when there are many tiny indexes). In summary, I would really like it if a consensus emerged on how parallel CREATE INDEX should handle the ecosystem of tools like pg_restore, reindexdb, and so on. Personally, I'm neutral on which general approach should be taken. Proposals from other hackers about what to do here are particularly welcome. -- Peter Geoghegan
On Mon, Dec 5, 2016 at 7:44 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Sat, Dec 3, 2016 at 7:23 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I do share your concerns about unpredictable behavior - that's
> particularly worrying for pg_restore, which may be used for time-
> sensitive use cases (DR, migrations between versions), so unpredictable
> changes in behavior / duration are unwelcome.
Right.
> But isn't this more a deficiency in pg_restore, than in CREATE INDEX?
> The issue seems to be that the reltuples value may or may not get
> updated, so maybe forcing ANALYZE (even very low statistics_target
> values would do the trick, I think) would be more appropriate solution?
> Or maybe it's time add at least some rudimentary statistics into the
> dumps (the reltuples field seems like a good candidate).
I think that there is a number of reasonable ways of looking at it. It
might also be worthwhile to have a minimal ANALYZE performed by CREATE
INDEX directly, iff there are no preexisting statistics (there is
definitely going to be something pg_restore-like that we cannot fix --
some ETL tool, for example). Perhaps, as an additional condition to
proceeding with such an ANALYZE, it should also only happen when there
is any chance at all of parallelism being used (but then you get into
having to establish the relation size reliably in the absence of any
pg_class.relpages, which isn't very appealing when there are many tiny
indexes).
In summary, I would really like it if a consensus emerged on how
parallel CREATE INDEX should handle the ecosystem of tools like
pg_restore, reindexdb, and so on. Personally, I'm neutral on which
general approach should be taken. Proposals from other hackers about
what to do here are particularly welcome.
Moved to next CF with "needs review" status.
Regards,
Hari Babu
Fujitsu Australia
On Wed, Sep 21, 2016 at 12:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I find this unification business really complicated. I think it'd be simpler > to keep the BufFiles and LogicalTapeSets separate, and instead teach > tuplesort.c how to merge tapes that live on different > LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single > LogicalTapeSet can contain tapes from different underlying BufFiles. > > What I have in mind is something like the attached patch. It refactors > LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape > as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet > doesn't have the concept of a tape number anymore, it can contain any number > of tapes, and you can create more on the fly. With that, it'd be fairly easy > to make tuplesort.c merge LogicalTapes that came from different tape sets, > backed by different BufFiles. I think that'd avoid much of the unification > code. I just looked at the buffile.c/buffile.h changes in the latest version of the patch and I agree with this criticism, though maybe not with the proposed solution. I actually don't understand what "unification" is supposed to mean. The patch really doesn't explain that anywhere that I can see. It says stuff like: + * Parallel operations can use an interface to unify multiple worker-owned + * BufFiles and a leader-owned BufFile within a leader process. This relies + * on various fd.c conventions about the naming of temporary files. That comment tells you that unification is a thing you can do -- via an unspecified interface for unspecified reasons using unspecified conventions -- but it doesn't tell you what the semantics of it are supposed to be. For example, if we "unify" several BufFiles, do they then have a shared seek pointer? Do the existing contents effectively get concatenated in an unpredictable order, or are they all expected to be empty at the time unification happens? Or something else? It's fine to make up new words -- indeed, in some sense that is the essence of writing any complex problem -- but you have to define them. As far as I can tell, the idea is that we're somehow magically concatenating the BufFiles into one big super-BufFile, but I'm fuzzy on exactly what's supposed to be going on there. It's hard to understand how something like this doesn't leak resources. Maybe that's been thought about here, but it isn't very clear to me how it's supposed to work. In Heikki's proposal, if process A is trying to read a file owned by process B, and process B dies and removes the file before process A gets around to reading it, we have got trouble, especially on Windows which apparently has low tolerance for such things. Peter's proposal avoids that - I *think* - by making the leader responsible for all resource cleanup, but that's inferior to the design we've used for other sorts of shared resource cleanup (DSM, DSA, shm_mq, lock groups) where the last process to detach always takes responsibility. That avoids assuming that we're always dealing with a leader-follower situation, it doesn't categorically require the leader to be the one who creates the shared resource, and it doesn't require the leader to be the last process to die. Imagine a data structure that is stored in dynamic shared memory and contains space for a filename, a reference count, and a mutex. Let's call this thing a SharedTemporaryFile or something like that. It offers these APIs: extern void SharedTemporaryFileInitialize(SharedTemporaryFile *); extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg); extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname); extern File SharedTemporaryFileGetFile(SharedTemporaryFile *); After setting aside sizeof(SharedTemporaryFile) bytes in your shared DSM sgement, you call SharedTemporaryFileInitialize() to initialize them. Then, every process that cares about the file does SharedTemporaryFileAttach(), which bumps the reference count and sets an on_dsm_detach hook to decrement the reference count and unlink the file if the reference count thereby reaches 0. One of those processes does SharedTemporaryFileAssign(), which fills in the pathname and clears FD_TEMPORARY. Then, any process that has attached can call SharedTemporaryFileGetFile() to get a File which can then be accessed normally. So, the pattern for parallel sort would be: - Leader sets aside space and calls SharedTemporaryFileInitialize() and SharedTemporaryFileAttach(). - The cooperating worker calls SharedTemporaryFileAttach() and then SharedTemporaryFileAssign(). - The leader then calls SharedTemporaryFileGetFile(). Since the leader can attach to the file before the path name is filled in, there's no window where the file is at risk of being leaked. Before SharedTemporaryFileAssign(), the worker is solely responsible for removing the file; after that call, whichever of the leader and the worker exits last will remove the file. > That leaves one problem, though: reusing space in the final merge phase. If > the tapes being merged belong to different LogicalTapeSets, and create one > new tape to hold the result, the new tape cannot easily reuse the space of > the input tapes because they are on different tape sets. If the worker is always completely finished with the tape before the leader touches it, couldn't the leader's LogicalTapeSet just "adopt" the tape and overwrite it like any other? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 20, 2016 at 2:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> What I have in mind is something like the attached patch. It refactors >> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape >> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet >> doesn't have the concept of a tape number anymore, it can contain any number >> of tapes, and you can create more on the fly. With that, it'd be fairly easy >> to make tuplesort.c merge LogicalTapes that came from different tape sets, >> backed by different BufFiles. I think that'd avoid much of the unification >> code. > > I just looked at the buffile.c/buffile.h changes in the latest version > of the patch and I agree with this criticism, though maybe not with > the proposed solution. I actually don't understand what "unification" > is supposed to mean. The patch really doesn't explain that anywhere > that I can see. It says stuff like: > > + * Parallel operations can use an interface to unify multiple worker-owned > + * BufFiles and a leader-owned BufFile within a leader process. This relies > + * on various fd.c conventions about the naming of temporary files. Without meaning to sound glib, unification is the process by which parallel CREATE INDEX has the leader read temp files from workers sufficient to complete its final on-the-fly merge. So, it's a terminology that's bit like "speculative insertion" was up until UPSERT was committed: a concept that is somewhat in flux, and describes a new low-level mechanism built to support a higher level operation, which must accord with a higher level set of requirements (so, for speculative insertion, that would be avoiding "unprincipled deadlocks" and so on). That being the case, maybe "unification" isn't useful as an precise piece of terminology at this point, but that will change. While I'm fairly confident that I basically have the right idea with this patch, I think that you are better at judging the ins and outs of resource management than I am, not least because of the experience of working on parallel query itself. Also, I'm signed up to review parallel hash join in large part because I think there might be some convergence concerning the sharing of BufFiles among parallel workers. I don't think I'm qualified to judge what a general abstraction like this should look like, but I'm trying to get there. > That comment tells you that unification is a thing you can do -- via > an unspecified interface for unspecified reasons using unspecified > conventions -- but it doesn't tell you what the semantics of it are > supposed to be. For example, if we "unify" several BufFiles, do they > then have a shared seek pointer? No. > Do the existing contents effectively > get concatenated in an unpredictable order, or are they all expected > to be empty at the time unification happens? Or something else? The order is the same order as ordinal identifiers are assigned to workers within tuplesort.c, which is undefined, with the notable exception of the leader's own identifier (-1) and area of the unified BufFile space (this is only relevant in randomAccess cases, where leader may write stuff out to its own reserved part of the BufFile space). It only matters that the bit of metadata in shared memory is in that same order, which it clearly will be. So, it's unpredictable, but in the same way that ordinal identifiers are assigned in a not-well-defined order; it doesn't or at least shouldn't matter. We can imagine a case where it does matter, and we probably should, but that case isn't parallel CREATE INDEX. >It's fine to make up new words -- indeed, in some sense that is the essence > of writing any complex problem -- but you have to define them. I invite you to help me define this new word. > It's hard to understand how something like this doesn't leak > resources. Maybe that's been thought about here, but it isn't very > clear to me how it's supposed to work. I agree that it would be useful to centrally document what all this unification stuff is about. Suggestions on where that should live are welcome. > In Heikki's proposal, if > process A is trying to read a file owned by process B, and process B > dies and removes the file before process A gets around to reading it, > we have got trouble, especially on Windows which apparently has low > tolerance for such things. Peter's proposal avoids that - I *think* - > by making the leader responsible for all resource cleanup, but that's > inferior to the design we've used for other sorts of shared resource > cleanup (DSM, DSA, shm_mq, lock groups) where the last process to > detach always takes responsibility. Maybe it's inferior to that, but I think what Heikki proposes is more or less complementary to what I've proposed, and has nothing to do with resource management and plenty to do with making the logtape.c interface look nice, AFAICT. It's also about refactoring/simplifying logtape.c itself, while we're at it. I believe that Heikki has yet to comment either way on my approach to resource management, one aspect of the patch that I was particularly keen on your looking into. The theory of operation here is that workers own their own BufFiles, and are responsible for deleting them when they die. The assumption, rightly or wrongly, is that it's sufficient that workers flush everything out (write out temp files), and yield control to the leader, which will open their temp files for the duration of the leader final on-the-fly merge. The resource manager in the leader knows it isn't supposed to ever delete worker-owned files (just close() the FDs), and the leader errors if it cannot find temp files that match what it expects. If there is a an error in the leader, it shuts down workers, and they clean up, more than likely. If there is an error in the worker, or if the files cannot be deleted (e.g., if there is a classic hard crash scenario), we should also be okay, because nobody will trip up on some old temp file from some worker, since fd.c has some gumption about what workers need to do (and what the leader needs to avoid) in the event of a hard crash. I don't see a risk of file descriptor leaks, which may or may not have been part of your concern (please clarify). > That avoids assuming that we're > always dealing with a leader-follower situation, it doesn't > categorically require the leader to be the one who creates the shared > resource, and it doesn't require the leader to be the last process to > die. I have an open mind about that, especially given the fact that I hope to generalize the unification stuff further, but I am not aware of any reason why that is strictly necessary. > Imagine a data structure that is stored in dynamic shared memory and > contains space for a filename, a reference count, and a mutex. Let's > call this thing a SharedTemporaryFile or something like that. It > offers these APIs: > > extern void SharedTemporaryFileInitialize(SharedTemporaryFile *); > extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg); > extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname); > extern File SharedTemporaryFileGetFile(SharedTemporaryFile *); I'm a little bit tired right now, and I have yet to look at Thomas' parallel hash join patch in any detail. I'm interested in what you have to say here, but I think that I need to learn more about its requirements in order to have an informed opinion. >> That leaves one problem, though: reusing space in the final merge phase. If >> the tapes being merged belong to different LogicalTapeSets, and create one >> new tape to hold the result, the new tape cannot easily reuse the space of >> the input tapes because they are on different tape sets. > > If the worker is always completely finished with the tape before the > leader touches it, couldn't the leader's LogicalTapeSet just "adopt" > the tape and overwrite it like any other? I'll remind you that parallel CREATE INDEX doesn't actually ever need to be randomAccess, and so we are not actually going to ever need to do this as things stand. I wrote the code that way in order to not break the existing interface, which seemed like a blocker to posting the patch. I am open to the idea of such an "adoption" occurring, even though it actually wouldn't help any case that exists in the patch as proposed. I didn't go that far in part because it seemed premature, given that nobody had looked at my work to date at the time, and given the fact that there'd be no initial user-visible benefit, and given how the exact meaning of "unification" was (and is) somewhat in flux. I see no good reason to not do that, although that might change if I actually seriously undertook to teach the leader about this kind of "adoption". I suspect that the interface specification would make for confusing reading, which isn't terribly appealing, but I'm sure I could manage to make it work given time. -- Peter Geoghegan
On Tue, Dec 20, 2016 at 8:14 PM, Peter Geoghegan <pg@heroku.com> wrote: > Without meaning to sound glib, unification is the process by which > parallel CREATE INDEX has the leader read temp files from workers > sufficient to complete its final on-the-fly merge. That's not glib, but you can't in the end define BufFile unification in terms of what parallel CREATE INDEX needs. Whatever changes we make to lower-level abstractions in the service of some higher-level goal need to be explainable on their own terms. >>It's fine to make up new words -- indeed, in some sense that is the essence >> of writing any complex problem -- but you have to define them. > > I invite you to help me define this new word. If at some point I'm able to understand what it means, I'll try to do that. I think you're loosely using "unification" to mean combining stuff from different backends in some way that depends on the particular context, so that "BufFile unification" can be different from "LogicalTape unification". But that's just punting the question of what each of those things actually are. > Maybe it's inferior to that, but I think what Heikki proposes is more > or less complementary to what I've proposed, and has nothing to do > with resource management and plenty to do with making the logtape.c > interface look nice, AFAICT. It's also about refactoring/simplifying > logtape.c itself, while we're at it. I believe that Heikki has yet to > comment either way on my approach to resource management, one aspect > of the patch that I was particularly keen on your looking into. My reading of Heikki's point was that there's not much point in touching the BufFile level of things if we can do all of the necessary stuff at the LogicalTape level, and I agree with him about that. If a shared BufFile had a shared read-write pointer, that would be a good justification for having it. But it seems like unification at the BufFile level is just concatenation, and that can be done just as well at the LogicalTape level, so why tinker with BufFile? As I've said, I think there's some low-level hacking needed here to make sure files get removed at the correct time in all cases, but apart from that I see no good reason to push the concatenation operation all the way down into BufFile. > The theory of operation here is that workers own their own BufFiles, > and are responsible for deleting them when they die. The assumption, > rightly or wrongly, is that it's sufficient that workers flush > everything out (write out temp files), and yield control to the > leader, which will open their temp files for the duration of the > leader final on-the-fly merge. The resource manager in the leader > knows it isn't supposed to ever delete worker-owned files (just > close() the FDs), and the leader errors if it cannot find temp files > that match what it expects. If there is a an error in the leader, it > shuts down workers, and they clean up, more than likely. If there is > an error in the worker, or if the files cannot be deleted (e.g., if > there is a classic hard crash scenario), we should also be okay, > because nobody will trip up on some old temp file from some worker, > since fd.c has some gumption about what workers need to do (and what > the leader needs to avoid) in the event of a hard crash. I don't see a > risk of file descriptor leaks, which may or may not have been part of > your concern (please clarify). I don't think there's any new issue with file descriptor leaks here, but I think there is a risk of calling unlink() too early or too late with your design. My proposal was an effort to nail that down real tight. >> If the worker is always completely finished with the tape before the >> leader touches it, couldn't the leader's LogicalTapeSet just "adopt" >> the tape and overwrite it like any other? > > I'll remind you that parallel CREATE INDEX doesn't actually ever need > to be randomAccess, and so we are not actually going to ever need to > do this as things stand. I wrote the code that way in order to not > break the existing interface, which seemed like a blocker to posting > the patch. I am open to the idea of such an "adoption" occurring, even > though it actually wouldn't help any case that exists in the patch as > proposed. I didn't go that far in part because it seemed premature, > given that nobody had looked at my work to date at the time, and given > the fact that there'd be no initial user-visible benefit, and given > how the exact meaning of "unification" was (and is) somewhat in flux. > > I see no good reason to not do that, although that might change if I > actually seriously undertook to teach the leader about this kind of > "adoption". I suspect that the interface specification would make for > confusing reading, which isn't terribly appealing, but I'm sure I > could manage to make it work given time. I think the interface is pretty clear: the worker's logical tapes get incorporated into the leader's LogicalTapeSet as if they'd been there all along. After all, by the time this is happening, IIUC (please confirm), the worker is done with those tapes and will never read or modify them again. If that's right, the worker just needs a way to identify those tapes to the leader, which can then add them to its LogicalTapeSet. That's it. It needs a way to identify them, but I think that shouldn't be hard; in fact, I think your patch has something like that already. And it needs to make sure that the files get removed at the right time, but I already sketched a solution to that problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/21/2016 12:53 AM, Robert Haas wrote: >> That leaves one problem, though: reusing space in the final merge phase. If >> the tapes being merged belong to different LogicalTapeSets, and create one >> new tape to hold the result, the new tape cannot easily reuse the space of >> the input tapes because they are on different tape sets. > > If the worker is always completely finished with the tape before the > leader touches it, couldn't the leader's LogicalTapeSet just "adopt" > the tape and overwrite it like any other? Currently, the logical tape code assumes that all tapes in a single LogicalTapeSet are allocated from the same BufFile. The logical tape's on-disk format contains block numbers, to point to the next/prev block of the tape [1], and they're assumed to refer to the same file. That allows reusing space efficiently during the merge. After you have read the first block from tapes A, B and C, you can immediately reuse those three blocks for output tape D. Now, if you read multiple tapes, from different LogicalTapeSet, hence backed by different BufFiles, you cannot reuse the space from those different tapes for a single output tape, because the on-disk format doesn't allow referring to blocks in other files. You could reuse the space of *one* of the input tapes, by placing the output tape in the same LogicalTapeSet, but not all of them. We could enhance that, by using "filename + block number" instead of just block number, in the pointers in the logical tapes. Then you could spread one logical tape across multiple files. Probably not worth it in practice, though. [1] As the code stands, there are no next/prev pointers, but a tree of "indirect" blocks. But I'm planning to change that to simpler next/prev pointers, in https://www.postgresql.org/message-id/flat/55b3b7ae-8dec-b188-b8eb-e07604052351%40iki.fi - Heikki
On Wed, Dec 21, 2016 at 7:04 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> If the worker is always completely finished with the tape before the >> leader touches it, couldn't the leader's LogicalTapeSet just "adopt" >> the tape and overwrite it like any other? > > Currently, the logical tape code assumes that all tapes in a single > LogicalTapeSet are allocated from the same BufFile. The logical tape's > on-disk format contains block numbers, to point to the next/prev block of > the tape [1], and they're assumed to refer to the same file. That allows > reusing space efficiently during the merge. After you have read the first > block from tapes A, B and C, you can immediately reuse those three blocks > for output tape D. I see. Hmm. > Now, if you read multiple tapes, from different LogicalTapeSet, hence backed > by different BufFiles, you cannot reuse the space from those different tapes > for a single output tape, because the on-disk format doesn't allow referring > to blocks in other files. You could reuse the space of *one* of the input > tapes, by placing the output tape in the same LogicalTapeSet, but not all of > them. > > We could enhance that, by using "filename + block number" instead of just > block number, in the pointers in the logical tapes. Then you could spread > one logical tape across multiple files. Probably not worth it in practice, > though. OK, so the options as I understand them are: 1. Enhance the logical tape set infrastructure in the manner you mention, to support filename (or more likely a proxy for filename) + block number in the logical tape pointers. Then, tapes can be transferred from one LogicalTapeSet to another. 2. Enhance the BufFile infrastructure to support some notion of a shared BufFile so that multiple processes can be reading and writing blocks in the same BufFile. Then, extend the logical tape infrastruture so that we also have the notion of a shared LogicalTape. This means that things like ltsGetFreeBlock() need to be re-engineered to handle concurrency with other backends. 3. Just live with the waste of space. I would guess that (1) is easier than (2). Also, (2) might provoke contention while writing tapes that is otherwise completely unnecessary. It seems silly to have multiple backends fighting over the same end-of-file pointer for the same file when they could just write to different files instead. Another tangentially-related problem I just realized is that we need to somehow handle the issues that tqueue.c does when transferring tuples between backends -- most of the time there's no problem, but if anonymous record types are involved then tuples require "remapping". It's probably harder to provoke a failure in the tuplesort case than with parallel query per se, but it's probably not impossible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 21, 2016 at 6:00 AM, Robert Haas <robertmhaas@gmail.com> wrote: > 3. Just live with the waste of space. I am loathe to create a special case for the parallel interface too, but I think it's possible that *no* caller will ever actually need to live with this restriction at any time in the future. I am strongly convinced that adopting tuplesort.c for parallelism should involve partitioning [1]. With that approach, even randomAccess callers will not want to read at random for one big materialized tape, since that's at odds with the whole point of partitioning, which is to remove any dependencies between workers quickly and early, so that as much work as possible is pushed down into workers. If a merge join were performed in a world where we have this kind of partitioning, we definitely wouldn't require one big materialized tape that is accessible within each worker. What are the chances of any real user actually having to live with the waste of space at some point in the future? > Another tangentially-related problem I just realized is that we need > to somehow handle the issues that tqueue.c does when transferring > tuples between backends -- most of the time there's no problem, but if > anonymous record types are involved then tuples require "remapping". > It's probably harder to provoke a failure in the tuplesort case than > with parallel query per se, but it's probably not impossible. Thanks for pointing that out. I'll look into it. BTW, I discovered a bug where there is very low memory available within each worker -- tuplesort.c throws an error within workers immediately. It's just a matter of making sure that they at least have 64KB of workMem, which is a pretty straightforward fix. Obviously it makes no sense to use so little memory in the first place; this is a corner case. [1] https://www.postgresql.org/message-id/CAM3SWZR+ATYAzyMT+hm-Bo=1L1smtJbNDtibwBTKtYqS0dYZVg@mail.gmail.com -- Peter Geoghegan
On Wed, Dec 21, 2016 at 10:21 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Dec 21, 2016 at 6:00 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> 3. Just live with the waste of space. > > I am loathe to create a special case for the parallel interface too, > but I think it's possible that *no* caller will ever actually need to > live with this restriction at any time in the future. I just realized that you were actually talking about the waste of space in workers here, as opposed to the theoretical waste of space that would occur in the leader should there ever be a parallel randomAccess tuplesort caller. To be clear, I am totally against allowing a waste of logtape.c temp file space in *workers*, because that implies a cost that will most certainly be felt by users all the time. -- Peter Geoghegan
On Tue, Dec 20, 2016 at 5:14 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Imagine a data structure that is stored in dynamic shared memory and >> contains space for a filename, a reference count, and a mutex. Let's >> call this thing a SharedTemporaryFile or something like that. It >> offers these APIs: >> >> extern void SharedTemporaryFileInitialize(SharedTemporaryFile *); >> extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg); >> extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname); >> extern File SharedTemporaryFileGetFile(SharedTemporaryFile *); > > I'm a little bit tired right now, and I have yet to look at Thomas' > parallel hash join patch in any detail. I'm interested in what you > have to say here, but I think that I need to learn more about its > requirements in order to have an informed opinion. Attached is V7 of the patch. The overall emphasis with this revision is on bringing clarity on how much can be accomplished using generalized infrastructure, explaining the unification mechanism coherently, and related issues. Notable changes --------------- * Rebased to work with the newly simplified logtape.c representation (the recent removal of "indirect blocks" by Heikki). Heikki's work was something that helped with simplifying the whole unification mechanism, to a significant degree. I think that there was over a 50% reduction in logtape.c lines of code in this revision. * randomAccess cases are now able to reclaim disk space from blocks originally written by workers. This further simplifies logtape.c changes significantly. I don't think that this is important because some future randomAccess caller might otherwise have double the storage overhead for their parallel sort, or even because of the disproportionate performance penalty such a caller would experience; rather, it's important because it removes previous special cases (that were internal to logtape.c). For example, aside from the fact that worker tapes within a unified tapeset will often have a non-zero offset, there is no state that actually remembers that this is a unified tapeset, because that isn't needed anymore. And, even though we reclaim blocks from workers, we only have one central chokepoint for applying worker offsets in the leader (that chokepoint is ltsReadFillBuffer()). Routines tasked with things like positional seeking for mark/restore for certain tuplesort clients (which are. in general, poorly tested) now need to have no knowledge of unification while still working just the same. This is a consequence of the fact that ltsWriteBlock() callers (and ltsWriteBlock() itself) never have to think about offsets. I'm pretty happy about that. * pg_restore now prevents the planner from deciding that parallelism should be used, in order to make restoration behavior more consistent and predictable. Iff a dump being restored happens to have a CREATE INDEX with the new index storage parameter parallel_workers set, then pg_restore will use parallel CREATE INDEX. This is accomplished with a new GUC, enable_parallelddl (since "max_parallel_workers_maintenance = 0" will disable parallel CREATE INDEX across the board, ISTM that a second new GUC is required). I think that this behavior the right trade-off for pg_restore goes, although I still don't feel particularly strongly about it. There is now a concrete proposal on what to do about pg_restore, if nothing else. To recap, the general concern address here is that there are typically no ANALYZE stats available for the planner to decide with when pg_restore runs CREATE INDEX, although that isn't always true, which was both surprising and inconsistent. * Addresses the problem of anonymous record types and their need for "remapping" across parallel workers. I've simply pushed the responsibility on callers within tuplesort.h contract; parallel CREATE INDEX callers don't need to care about this, as explained there. (CLUSTER tuplesorts would also be safe.) * Puts the whole rationale for unification into one large comment above the function BufFileUnify(), and removes traces of the same kind of discussion from everywhere else. I think that buffile.c is the right central place to discuss the unification mechanism, now that logtape.c has been greatly simplified. All the fd.c changes are in routines that are only ever called by buffile.c anyway, and are not too complicated (in general, temp fd.c files are only ever owned transitively, through BufFiles). So, morally, the unification mechanism is something that wholly belongs to buffile.c, since unification is all about temp files, and buffile.h is the interface through which temp files are owned and accessed in general, without exception. Unification remains specialized ------------------------------- On the one hand, BufFileUnify() now describes the whole idea of unification in detail, in its own general terms, including its performance characteristics, but on the other hand it doesn't pretend to be more general than it is (that's why we really have to talk about performance characteristics). It doesn't go as far as admitting to being the thing that logtape.c uses for parallel sort, but even that doesn't seem totally unreasonable to me. I think that BufFileUnify() might also end up being used by tuplestore.c, so it isn't entirely non-general, but I now realize that it's unlikely to be used by parallel hash join. So, while randomAccess reclamation of worker blocks within the leader now occurs, I have not followed Robert's suggestion in full. For example, I didn't do this: "ltsGetFreeBlock() need to be re-engineered to handle concurrency with other backends". The more I've thought about it, the more appropriate the kind of specialization I've come up with seems. I've concluded: - Sorting is important, and therefore worth adding non-general infrastructure in support of. It's important enough to have its own logtape.c module, so why not this? Much of buffile.c was explicitly written with sorting and hashing in mind from the beginning. We use BufFiles for other things, but those two things are by far the two most important users of temp files, and the only really compelling candidates for parallelization. - There are limited opportunities to share BufFile infrastructure for parallel sorting and parallel hashing. Hashing is inverse to sorting conceptually, so it should not be surprising that this is the case. By that I mean that hashing is characterized by logical division and physical combination, whereas sorting is characterized by physical division and logical combination. Parallel tuplesort naturally allows each worker to do an enormous amount of work with whatever data it is fed by the parallel heap scan that it joins, *long* before the data needs to be combined with data from other workers in any way. Consider this code from Thomas' parallel hash join patch: > +bool > +ExecHashCheckForEarlyExit(HashJoinTable hashtable) > +{ > + /* > + * The golden rule of leader deadlock avoidance: since leader processes > + * have two separate roles, namely reading from worker queues AND executing > + * the same plan as workers, we must never allow a leader to wait for > + * workers if there is any possibility those workers have emitted tuples. > + * Otherwise we could get into a situation where a worker fills up its > + * output tuple queue and begins waiting for the leader to read, while > + * the leader is busy waiting for the worker. > + * > + * Parallel hash joins with shared tables are inherently susceptible to > + * such deadlocks because there are points at which all participants must > + * wait (you can't start check for unmatched tuples in the hash table until > + * probing has completed in all workers, etc). Parallel sort will never have to do anything like this. There is minimal IPC before the leader's merge, and the dependencies between phases are extremely simple (there is only one; workers need to finish before leader can merge, and must stick around in a quiescent state throughout). Data throughput is what tuplesort cares about; it doesn't really care about latency. Whereas, I gather that there needs to be continual gossip between hash join workers (those building a hash table) about the number of batches. They don't have to be in perfect lockstep, but they need to cooperate closely; the IPC is pretty eager, and therefore latency sensitive. Thomas makes use of atomic ops in his patch, which makes sense, but I'd never bother with anything like that for parallel tuplesort; there'd be no measurable benefit there. In general, it's not obvious to me that the SharedTemporaryFile() API that Robert sketched recently (or any very general shared file interface that does things like buffer writes in shared memory, uses a shared read pointer, etc) is right for either parallel hash join or parallel sort. I don't see that there is much to be said for a reference count mechanism for parallel sort BufFiles, since the dependencies are so simple and fixed, and for hash join, a much tighter mechanism seems desirable. I can't think why Thomas would want a shared read pointer, since the way he builds the shared hash table leaves it immutable once probing is underway; ISTM that he'll want that kind of mechanism to operate at a higher level, in a more specialized way. That said, I don't actually know what Thomas has in mind for multi-batch parallel hash joins, since that's only a TODO item in the most recent revision of his patch (maybe I missed something he wrote on this topic, though). Thomas is working on a revision that resolves that open item, at which point we'll know more. I understand that a new revision of his patch that closes out the TODO item isn't too far from being posted. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan <pg@heroku.com> wrote: > Attached is V7 of the patch. I am doing some testing. First, some superficial things from first pass: Still applies with some offsets and one easy-to-fix rejected hunk in nbtree.c (removing some #include directives and a struct definition). +/* Sort parallel code from state for sort__start probes */ +#define PARALLEL_SORT(state) ((state)->shared == NULL ? 0 : \ + (state)->workerNum >= 0 : 1 : 2) Typo: ':' instead of '?', --enable-dtrace build fails. + the entire utlity command, regardless of the number of Typo: s/utlity/utility/ + /* Perform sorting of spool, and possibly a spool2 */ + sortmem = Max(maintenance_work_mem / btshared->scantuplesortstates, 64); Just an observation: if you ask for a large number of workers, but only one can be launched, it will be constrained to a small fraction of maintenance_work_mem, but use only one worker. That's probably OK, and I don't see how to do anything about it unless you are prepared to make workers wait for an initial message from the leader to inform them how many were launched. Should this 64KB minimum be mentioned in the documentation? + if (!btspool->isunique) + { + shm_toc_estimate_keys(&pcxt->estimator, 2); + } Project style: people always tell me to drop the curlies in cases like that. There are a few more examples in the patch. + /* Wait for workers */ + ConditionVariableSleep(&shared->workersFinishedCv, + WAIT_EVENT_PARALLEL_FINISH); I don't think we should reuse WAIT_EVENT_PARALLEL_FINISH in tuplesort_leader_wait and worker_wait. That belongs to WaitForParallelWorkersToFinish, so someone who see that in pg_stat_activity won't know which it is. IIUC worker_wait() is only being used to keep the worker around so its files aren't deleted. Once buffile cleanup is changed to be ref-counted (in an on_dsm_detach hook?) then workers might as well exit sooner, freeing up a worker slot... do I have that right? Incidentally, barrier.c could probably be used for this synchronisation instead of these functions. I think _bt_begin_parallel would call BarrierInit(&shared->barrier, scantuplesortstates) and then after LaunchParallelWorkers() it'd call a new interface BarrierDetachN(&shared->barrier, scantuplesortstates - pcxt->nworkers_launched) to forget about workers that failed to launch. Then you could use BarrierWait where the leader waits for the workers to finish, and BarrierDetach where the workers are finished and want to exit. + /* Prepare state to create unified tapeset */ + leaderTapes = palloc(sizeof(TapeShare) * state->maxTapes); Missing cast (TapeShare *) here? Project style judging by code I've seen, and avoids gratuitously C++ incompatibility. +_bt_parallel_shared_estimate(Snapshot snapshot) ... +tuplesort_estimate_shared(int nWorkers) Inconsistent naming? More soon. -- Thomas Munro http://www.enterprisedb.com
On Mon, Jan 30, 2017 at 8:46 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Attached is V7 of the patch. > > I am doing some testing. First, some superficial things from first pass: > > [Various minor cosmetic issues] Oops. > Just an observation: if you ask for a large number of workers, but > only one can be launched, it will be constrained to a small fraction > of maintenance_work_mem, but use only one worker. That's probably OK, > and I don't see how to do anything about it unless you are prepared to > make workers wait for an initial message from the leader to inform > them how many were launched. Actually, the leader-owned worker Tuplesort state will have the appropriate amount, so you'd still need to have 2 participants (1 worker + leader-as-worker). And, sorting is much less sensitive to having a bit less memory than hashing (at least when there isn't dozens of runs to merge in the end, or multiple passes). So, I agree that this isn't worth worrying about for a DDL statement. > Should this 64KB minimum be mentioned in the documentation? You mean user-visible documentation, and not just tuplesort.h? I don't think that that's necessary. That's a ludicrously low amount of memory for a worker to be limited to anyway. It will never come up with remotely sensible use of the feature. > + if (!btspool->isunique) > + { > + shm_toc_estimate_keys(&pcxt->estimator, 2); > + } > > Project style: people always tell me to drop the curlies in cases like > that. There are a few more examples in the patch. I only do this when there is an "else" that must have curly braces, too. There are plenty of examples of this from existing code, so I think it's fine. > + /* Wait for workers */ > + ConditionVariableSleep(&shared->workersFinishedCv, > + WAIT_EVENT_PARALLEL_FINISH); > > I don't think we should reuse WAIT_EVENT_PARALLEL_FINISH in > tuplesort_leader_wait and worker_wait. That belongs to > WaitForParallelWorkersToFinish, so someone who see that in > pg_stat_activity won't know which it is. Noted. > IIUC worker_wait() is only being used to keep the worker around so its > files aren't deleted. Once buffile cleanup is changed to be > ref-counted (in an on_dsm_detach hook?) then workers might as well > exit sooner, freeing up a worker slot... do I have that right? Yes. Or at least I think it's very likely that that will end up happening. > Incidentally, barrier.c could probably be used for this > synchronisation instead of these functions. I think > _bt_begin_parallel would call BarrierInit(&shared->barrier, > scantuplesortstates) and then after LaunchParallelWorkers() it'd call > a new interface BarrierDetachN(&shared->barrier, scantuplesortstates - > pcxt->nworkers_launched) to forget about workers that failed to > launch. Then you could use BarrierWait where the leader waits for the > workers to finish, and BarrierDetach where the workers are finished > and want to exit. I thought about doing that, actually, but I don't like creating dependencies on some other uncommited patch, which is a moving target (barrier stuff isn't committed yet). It makes life difficult for reviewers. I put off adopting condition variables until they were committed for the same reason -- it's was easy to do without them for a time. I'll probably get around to it before too long, but feel no urgency about it. Barriers will only allow me to make a modest net removal of code, AFAIK. Thanks -- Peter Geoghegan
On Tue, Jan 31, 2017 at 12:15 AM, Peter Geoghegan <pg@bowt.ie> wrote: >> Should this 64KB minimum be mentioned in the documentation? > > You mean user-visible documentation, and not just tuplesort.h? I don't > think that that's necessary. That's a ludicrously low amount of memory > for a worker to be limited to anyway. It will never come up with > remotely sensible use of the feature. I agree. >> + if (!btspool->isunique) >> + { >> + shm_toc_estimate_keys(&pcxt->estimator, 2); >> + } >> >> Project style: people always tell me to drop the curlies in cases like >> that. There are a few more examples in the patch. > > I only do this when there is an "else" that must have curly braces, > too. There are plenty of examples of this from existing code, so I > think it's fine. But I disagree on this one. I think if (blah) stuff(); else { thing(); gargle(); } ...is much better than if (blah) { stuff(); } else { thing(); gargle(); } But if there were a comment on a separate line before the call to stuff(), then I would do it the second way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 31, 2017 at 2:15 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jan 30, 2017 at 8:46 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan <pg@heroku.com> wrote: >>> Attached is V7 of the patch. >> >> I am doing some testing. First, some superficial things from first pass: >> >> [Various minor cosmetic issues] > > Oops. As this review is very recent, I have moved the patch to CF 2017-03. -- Michael
On Wed, Feb 1, 2017 at 5:37 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Jan 31, 2017 at 2:15 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> On Mon, Jan 30, 2017 at 8:46 PM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan <pg@heroku.com> wrote: >>>> Attached is V7 of the patch. >>> >>> I am doing some testing. First, some superficial things from first pass: >>> >>> [Various minor cosmetic issues] >> >> Oops. > > As this review is very recent, I have moved the patch to CF 2017-03. ParallelContext * -CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers) +CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers, + bool serializable_okay){ MemoryContext oldcontext; ParallelContext*pcxt; @@ -143,7 +144,7 @@ CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers) * workers, at least not until somebody enhances that mechanism to be * parallel-aware. */ - if (IsolationIsSerializable()) + if (IsolationIsSerializable() && !serializable_okay) nworkers = 0; That's a bit weird but I can't think of a problem with it. Workers run with MySerializableXact == InvalidSerializableXact, even though they may have the snapshot of a SERIALIZABLE leader. Hopefully soon the restriction on SERIALIZABLE in parallel queries can be lifted anyway, and then this could be removed. Here are some thoughts on the overall approach. Disclaimer: I haven't researched the state of the art in parallel sort or btree builds. But I gather from general reading that there are a couple of well known approaches, and I'm sure you'll correct me if I'm off base here. 1. All participants: parallel sequential scan, repartition on the fly so each worker has tuples in a non-overlapping range, sort, build disjoint btrees; barrier; leader: merge disjoint btrees into one. 2. All participants: parallel sequential scan, sort, spool to disk; barrier; leader: merge spooled tuples and build btree. This patch is doing the 2nd thing. My understanding is that some systems might choose to do that if they don't have or don't like the table's statistics, since repartitioning for balanced load requires carefully chosen ranges and is highly sensitive to distribution problems. It's pretty clear that approach 1 is a difficult project. From my research into dynamic repartitioning in the context of hash joins, I can see that that infrastructure is a significant project in its own right: subproblems include super efficient tuple exchange, buffering, statistics/planning and dealing with/adapting to bad outcomes. I also suspect that repartitioning operators might need to be specialised for different purposes like sorting vs hash joins, which may have differing goals. I think it's probably easy to build a slow dynamic repartitioning mechanism that frequently results in terrible worst case scenarios where you paid a fortune in IPC overheads and still finished up with one worker pulling most of the whole load. Without range partitioning, I don't believe you can merge the resulting non-disjoint btrees efficiently so you'd probably finish up writing a complete new btree to mash them together. As for merging disjoint btrees, I assume there are ways to do a structure-preserving merge that just rebuilds some internal pages and incorporates the existing leaf pages directly, a bit like tree manipulation in functional programming languages; that'll take some doing. So I'm in favour of this patch, which is relatively simple and give us faster index builds soon. Eventually we might also be able to have approach 1. From what I gather, it's entirely possible that we might still need 2 to fall back on in some cases. Will you move the BufFile changes to a separate patch in the next revision? Still testing and reviewing, more soon. -- Thomas Munro http://www.enterprisedb.com
On Tue, Jan 31, 2017 at 11:23 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > 2. All participants: parallel sequential scan, sort, spool to disk; > barrier; leader: merge spooled tuples and build btree. > > This patch is doing the 2nd thing. My understanding is that some > systems might choose to do that if they don't have or don't like the > table's statistics, since repartitioning for balanced load requires > carefully chosen ranges and is highly sensitive to distribution > problems. The second thing here seems to offer comparable scalability to other system implementation's of the first thing. They seem to have reused "partitioning to sort in parallel" for B-Tree builds, at least in some cases, despite this. WAL logging is the biggest serial bottleneck here for other systems, I've heard -- that's still going to be pretty much serial. I think that the fact that some systems do partitioning for parallel B-Tree builds might have as much to do with their ability to create B-Tree indexes in place as anything else. Apparently, some systems don't use temp files, instead writing out what is for all intents and purposes part of a finished B-Tree as runs (no use of temp_tablespaces). That may be a big part of what makes it worthwhile to try to use partitioning. I understand that only the highest client counts will see much direct performance benefit relative to the first approach. > It's pretty clear that approach 1 is a difficult project. From my > research into dynamic repartitioning in the context of hash joins, I > can see that that infrastructure is a significant project in its own > right: subproblems include super efficient tuple exchange, buffering, > statistics/planning and dealing with/adapting to bad outcomes. I also > suspect that repartitioning operators might need to be specialised for > different purposes like sorting vs hash joins, which may have > differing goals. I think it's probably easy to build a slow dynamic > repartitioning mechanism that frequently results in terrible worst > case scenarios where you paid a fortune in IPC overheads and still > finished up with one worker pulling most of the whole load. Without > range partitioning, I don't believe you can merge the resulting > non-disjoint btrees efficiently so you'd probably finish up writing a > complete new btree to mash them together. As for merging disjoint > btrees, I assume there are ways to do a structure-preserving merge > that just rebuilds some internal pages and incorporates the existing > leaf pages directly, a bit like tree manipulation in functional > programming languages; that'll take some doing. I agree with all that. "Stitching together" disjoint B-Trees does seem to have some particular risks, which users of other systems are cautioned against in their documentation. You can end up with an unbalanced B-Tree. > So I'm in favour of this patch, which is relatively simple and give us > faster index builds soon. Eventually we might also be able to have > approach 1. From what I gather, it's entirely possible that we might > still need 2 to fall back on in some cases. Right. And it can form the basis of an implementation of 1, which in any case seems to be much more compelling for parallel query, when a great deal more can be pushed down, and we are not particularly likely to be I/O bound (usually not much writing to the heap, or WAL logging). > Will you move the BufFile changes to a separate patch in the next revision? That is the plan. I need to get set up with a new machine here, having given back my work laptop to Heroku, but it shouldn't take too long. Thanks for the review. -- Peter Geoghegan
On Wed, Feb 1, 2017 at 8:46 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Jan 31, 2017 at 11:23 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> So I'm in favour of this patch, which is relatively simple and give us >> faster index builds soon. Eventually we might also be able to have >> approach 1. From what I gather, it's entirely possible that we might >> still need 2 to fall back on in some cases. > > Right. And it can form the basis of an implementation of 1, which in > any case seems to be much more compelling for parallel query, when a > great deal more can be pushed down, and we are not particularly likely > to be I/O bound (usually not much writing to the heap, or WAL > logging). I ran some tests today. First I created test tables representing the permutations of these choices: Table structure: int = Integer key only intwide = Integer key + wide row text = Text key only (using dictionary words) textwide = Text key + wide row Uniqueness: u = each value unique d = 10 duplicates of each value Heap physical order: rand = Random asc = Ascending order (already sorted) desc = Descending order (sorted backwards) I used 10 million rows for this test run, so that gave me 24 tables of the following sizes as reported in "\d+": int tables = 346MB each intwide tables = 1817MB each text tables = 441MB each textwide tables = 1953MB each It'd be interesting to test larger tables of course but I had a lot of permutations to get through. For each of those tables I ran tests corresponding to the permutations of these three variables: Index type: uniq = CREATE UNIQUE INDEX ("u" tables only, ie no duplicates) nonu = CREATE INDEX ("u" and "d" tables) Maintenance memory: 1M, 64MB, 256MB, 512MB Workers: from 0 up to 8 Environment: EDB test machine "cthulhu", Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz, 8 socket, 8 cores (16 threads) per socket, CentOS 7.2, Linux kernel 3.10.0-229.7.2.el7.x86_64, 512GB RAM, pgdata on SSD. Database initialised with en_US.utf-8 collation, all defaults except max_wal_size increased to 4GB (otherwise warnings about too frequent checkpoints) and max_parallel_workers_maintenance = 8. Testing done with warm OS cache. I applied your v2 patch on top of 7ac4a389a7dbddaa8b19deb228f0a988e79c5795^ to avoid a conflict. It still had a couple of harmless conflicts that I was able to deal with (not code, just some header stuff moving around). See full results from all permutations attached, but I wanted to highlight the measurements from 'textwide', 'u', 'nonu' which show interesting 'asc' numbers (data already sorted). The 'mem' column is maintenance_work_mem in megabytes. The 'w = 0' column shows the time in seconds for parallel_workers = 0. The other 'w = N' columns show times with higher parallel_workers settings, represented as speed-up relative to the 'w = 0' time. 1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns show speed-up relative to that time): mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 -----+--------+-------+-------+-------+-------+-------+-------+-------+------- 1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x 64 | 19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x 256 | 18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x 512 | 17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x 2. 'rand' = randomised data: mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 -----+--------+-------+-------+-------+-------+-------+-------+-------+------- 1 | 130.25 | 1.82x | 2.19x | 2.52x | 2.58x | 2.72x | 2.72x | 2.83x | 2.89x 64 | 117.36 | 1.80x | 2.20x | 2.43x | 2.47x | 2.55x | 2.51x | 2.59x | 2.69x 256 | 124.68 | 1.87x | 2.20x | 2.49x | 2.52x | 2.64x | 2.70x | 2.72x | 2.75x 512 | 115.77 | 1.51x | 1.72x | 2.14x | 2.08x | 2.19x | 2.31x | 2.44x | 2.48x 3. 'desc' = reverse-sorted data: mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 -----+--------+-------+-------+-------+-------+-------+-------+-------+------- 1 | 115.19 | 1.88x | 2.39x | 2.78x | 3.50x | 3.62x | 4.20x | 4.19x | 4.39x 64 | 112.17 | 1.85x | 2.25x | 2.99x | 3.63x | 3.65x | 4.01x | 4.31x | 4.62x 256 | 119.55 | 1.76x | 2.21x | 2.85x | 3.43x | 3.37x | 3.77x | 4.24x | 4.28x 512 | 119.50 | 1.85x | 2.19x | 2.87x | 3.26x | 3.28x | 3.74x | 4.24x | 3.93x The 'asc' effects are much less pronounced when the key is an int. Here is the equivalent data for 'intwide', 'u', 'nonu': 1. 'asc' mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 -----+-------+-------+-------+-------+-------+-------+-------+-------+------- 1 | 12.19 | 1.55x | 1.93x | 2.21x | 2.44x | 2.64x | 2.76x | 2.91x | 2.83x 64 | 7.35 | 1.29x | 1.53x | 1.69x | 1.86x | 1.98x | 2.04x | 2.07x | 2.09x 256 | 7.34 | 1.26x | 1.47x | 1.64x | 1.79x | 1.92x | 1.96x | 1.98x | 2.02x 512 | 7.24 | 1.24x | 1.46x | 1.65x | 1.80x | 1.91x | 1.97x | 2.00x | 1.92x 2. 'rand' mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 -----+-------+-------+-------+-------+-------+-------+-------+-------+------- 1 | 15.16 | 1.56x | 2.01x | 2.32x | 2.57x | 2.73x | 2.87x | 2.95x | 2.91x 64 | 12.97 | 1.55x | 1.97x | 2.25x | 2.44x | 2.58x | 2.70x | 2.74x | 2.71x 256 | 13.14 | 1.47x | 1.86x | 2.12x | 2.31x | 2.50x | 2.62x | 2.58x | 2.69x 512 | 13.61 | 1.48x | 1.91x | 2.22x | 2.37x | 2.55x | 2.65x | 2.73x | 2.73x 3. 'desc' mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 -----+-------+-------+-------+-------+-------+-------+-------+-------+------- 1 | 13.45 | 1.51x | 1.94x | 2.31x | 2.56x | 2.75x | 2.95x | 3.05x | 3.00x 64 | 10.27 | 1.42x | 1.82x | 2.05x | 2.30x | 2.46x | 2.59x | 2.64x | 2.65x 256 | 10.52 | 1.39x | 1.70x | 2.02x | 2.24x | 2.34x | 2.39x | 2.48x | 2.56x 512 | 10.62 | 1.43x | 1.82x | 2.06x | 2.32x | 2.51x | 2.61x | 2.68x | 2.69x Full result summary and scripts used for testing attached. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Feb 3, 2017 at 5:04 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I applied your v2 patch on top of > 7ac4a389a7dbddaa8b19deb228f0a988e79c5795^ to avoid a conflict. It > still had a couple of harmless conflicts that I was able to deal with > (not code, just some header stuff moving around). You must mean my V7 patch. FWIW, I've resolved the conflicts with 7ac4a389a7dbddaa8b19deb228f0a988e79c5795 in my own private branch, and have worked through some of the open items that you raised. > See full results from all permutations attached, but I wanted to > highlight the measurements from 'textwide', 'u', 'nonu' which show > interesting 'asc' numbers (data already sorted). The 'mem' column is > maintenance_work_mem in megabytes. The 'w = 0' column shows the time > in seconds for parallel_workers = 0. The other 'w = N' columns show > times with higher parallel_workers settings, represented as speed-up > relative to the 'w = 0' time. The thing to keep in mind about testing presorted cases in tuplesort in general is that we have this weird precheck for presorted input in our qsort. This is something added by us to the original Bentley & McIlroy algorithm in 2006. I am very skeptical of this addition, in general. It tends to have the effect of highly distorting how effective most optimizations are for presorted cases, which comes up again and again. It only works when the input is *perfectly* presorted, but can throw away an enormous amount of work when the last tuple of input is out of order -- that will throw away all work before that point (not so bad when you think your main cost is comparisons rather than memory accesses, but that isn't the case). Your baseline case can either be made unrealistically fast due to the fact that you get a perfectly sympathetic case for this optimization, or unrealistically slow (very CPU bound) due to the fact that you have that one last tuple out of place. I once said that this last tuple can act like a discarded banana skin. There is nothing wrong with the idea of exploiting presortedness, and to some extent the original algorithm does that (by using insertion sort), but an optimization along the lines of Timsort's "galloping mode" (which is what this modification of ours attempts) requires non-trivial bookkeeping to do right. -- Peter Geoghegan
On Fri, Feb 3, 2017 at 5:04 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > 1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns > show speed-up relative to that time): > > mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 > -----+--------+-------+-------+-------+-------+-------+-------+-------+------- > 1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x > 64 | 19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x > 256 | 18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x > 512 | 17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x I think that this presorted case doesn't improve much because the sorting itself is so cheap, as explained in my last mail. However, the improvements as workers are added is still smaller than expected. I think that this indicates that there isn't enough I/O capacity available here to truly show the full potential of the patch -- I've certainly seen better scalability for cases like this when there is a lot of I/O bandwidth available, and I/O parallelism is there to be taken advantage of. Say, when using a system with a large RAID array (I used a RAID0 array with 12 HDDs for my own tests). Another issue is that you probably don't have enough data here to really show off the patch. I don't want to dismiss the benchmark, which is still quite informative, but it's worth pointing out that the feature is going to be most compelling for very large indexes, that will take at least several minutes to build under any circumstances. (Having a reproducible case is also important, which what you have here has going for it, on the other hand.) I suspect that this system isn't particularly well balanced for the task of benchmarking the patch. You would probably see notably better scalability than any you've shown in any test if you could add additional sequential I/O bandwidth, which is probably an economical, practical choice for many users. I suspect that you aren't actually saturating available CPUs to the greatest extent that the implementations makes possible. Another thing I want to point out is that with 1MB of maintenance_work_mem, the patch appears to do very well, but that isn't terribly meaningful. I would suggest that we avoid testing this patch with such a low amount of memory -- it doesn't seem important. This is skewed by the fact that you're using replacement selection in the serial case only. I think what this actually demonstrates is that replacement selection is very slow, even with its putative best case. I believe that commit 2459833 was the final nail in the coffin of replacement selection. I certainly don't want to relitigate the discussion on replacement_sort_tuples, and am not going to push too hard, but ISTM that we should fully remove replacement selection from tuplesort.c and be done with it. -- Peter Geoghegan
On Sat, Feb 4, 2017 at 11:58 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Feb 3, 2017 at 5:04 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> 1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns >> show speed-up relative to that time): >> >> mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 >> -----+--------+-------+-------+-------+-------+-------+-------+-------+------- >> 1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x >> 64 | 19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x >> 256 | 18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x >> 512 | 17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x > > I think that this presorted case doesn't improve much because the > sorting itself is so cheap, as explained in my last mail. However, the > improvements as workers are added is still smaller than expected. I > think that this indicates that there isn't enough I/O capacity > available here to truly show the full potential of the patch -- I've > certainly seen better scalability for cases like this when there is a > lot of I/O bandwidth available, and I/O parallelism is there to be > taken advantage of. Say, when using a system with a large RAID array > (I used a RAID0 array with 12 HDDs for my own tests). Another issue is > that you probably don't have enough data here to really show off the > patch. I don't want to dismiss the benchmark, which is still quite > informative, but it's worth pointing out that the feature is going to > be most compelling for very large indexes, that will take at least > several minutes to build under any circumstances. (Having a > reproducible case is also important, which what you have here has > going for it, on the other hand.) Right. My main reason for starting smallish was to allow me to search a space with several variables without waiting eons. Next I would like to run a small subset of those tests with, say, 10, 20 or even 100 times more data loaded, so the tables would be ~20GB, ~40GB or ~200GB. About read bandwidth: It shouldn't have been touching the disk at all for reads: I did a dummy run of the index build before the measured runs, so that a 2GB table being sorted in ~2 minutes would certainly have come entirely from the OS page cache since the machine has oodles of RAM. About write bandwidth: The WAL, the index and the temp files all went to an SSD array, though I don't have the characteristics of that to hand. I should also be able to test on multi-spindle HDD array. I doubt either can touch your 12 way RAID0 array, but will look into that. > I suspect that this system isn't particularly well balanced for the > task of benchmarking the patch. You would probably see notably better > scalability than any you've shown in any test if you could add > additional sequential I/O bandwidth, which is probably an economical, > practical choice for many users. I suspect that you aren't actually > saturating available CPUs to the greatest extent that the > implementations makes possible. I will look into what IO options I can access before running larger tests. Also I will look into running the test with both cold and warm caches (ie "echo 1 > /proc/sys/vm/drop_caches") so that read bandwidth enters the picture. > Another thing I want to point out is that with 1MB of > maintenance_work_mem, the patch appears to do very well, but that > isn't terribly meaningful. I would suggest that we avoid testing this > patch with such a low amount of memory -- it doesn't seem important. > This is skewed by the fact that you're using replacement selection in > the serial case only. I think what this actually demonstrates is that > replacement selection is very slow, even with its putative best case. > I believe that commit 2459833 was the final nail in the coffin of > replacement selection. I certainly don't want to relitigate the > discussion on replacement_sort_tuples, and am not going to push too > hard, but ISTM that we should fully remove replacement selection from > tuplesort.c and be done with it. Interesting. I haven't grokked this but will go and read about it. Based on your earlier comments about banana skin effects, I'm wondering if it would be interesting to add a couple more heap distributions to the test set that are almost completely sorted except for a few entries out of order. -- Thomas Munro http://www.enterprisedb.com
On Fri, Feb 3, 2017 at 4:15 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> I suspect that this system isn't particularly well balanced for the >> task of benchmarking the patch. You would probably see notably better >> scalability than any you've shown in any test if you could add >> additional sequential I/O bandwidth, which is probably an economical, >> practical choice for many users. I suspect that you aren't actually >> saturating available CPUs to the greatest extent that the >> implementations makes possible. > > I will look into what IO options I can access before running larger > tests. Also I will look into running the test with both cold and warm > caches (ie "echo 1 > /proc/sys/vm/drop_caches") so that read bandwidth > enters the picture. It might just have been that the table was too small to be an effective target for parallel sequential scan with so many workers, and so a presorted best case CREATE INDEX, which isn't that different, also fails to see much benefit (compared to what you'd see with a similar case involving a larger table). In other words, I might have jumped the gun in emphasizing issues with hardware and I/O bandwidth over issues around data volume (that I/O parallelism is inherently not very helpful with these relatively small tables). As I've pointed out a couple of times before, bigger sorts will be more CPU bound because sorting itself has costs that grow linearithmically, whereas writing out runs has costs that grow linearly. The relative cost of the I/O can be expected to go down as input goes up for this reason. At the same time, a larger input might make better use of I/O parallelism, which reduces the cost paid in latency to write out runs in absolute terms. -- Peter Geoghegan
On Mon, Jan 30, 2017 at 9:15 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> IIUC worker_wait() is only being used to keep the worker around so its >> files aren't deleted. Once buffile cleanup is changed to be >> ref-counted (in an on_dsm_detach hook?) then workers might as well >> exit sooner, freeing up a worker slot... do I have that right? > > Yes. Or at least I think it's very likely that that will end up happening. I've looked into this, and have a version of the patch where clean-up occurs when the last backend with a reference to the BufFile goes away. It seems robust; all of my private tests pass, which includes things that parallel CREATE INDEX won't use, and yet is added as infrastructure (e.g., randomAccess recycling of blocks by leader from workers). As Thomas anticipated, worker_wait() now only makes workers wait until the leader comes along to take a reference to their files, at which point the worker processes can go away. In effect, the worker processes go away as soon as possible, just as the leader begins its final on-the-fly merge. At that point, they could be reused by some other process, of course. However, there are some specific implementation issues with this that I didn't quite anticipate. I would like to get feedback on these issues now, from both Thomas and Robert. The issues relate to how much the patch can or should "buy into resource management". You might guess that this new resource management code is something that should live in fd.c, alongside the guts of temp file resource management, within the function FileClose(). That way, it would be called by every possible path that might delete a temp file, including ResourceOwnerReleaseInternal(). That's not what I've done, though. Instead, refcount management is limited to a few higher level routines in buffile.c. Initially, resource management in FileClose() is made to assume that it must delete the file. Then, if and when directed to by BufFileClose()/refcount, a backend may determine that it is not its job to do the deletion -- it will not be the one that must "turn out the lights", and so indicates to FileClose() that it should not delete the file after all (it should just release vFDs, close(), and so on). Otherwise, when refcount reaches zero, temp files are deleted by FileClose() in more or less the conventional manner. The fact that there could, in general, be any error that causes us to attempt a double-deletion (deletion of a temp file from more than one backend) for a time is less of a problem than you might think. This is because there is a risk of this only for as long as two backends hold open the file at the same time. In the case of parallel CREATE INDEX, this is now the shortest possible period of time, since workers close their files using BufFileClose() immediately after the leader wakes them up from a quiescent state. And, if that were to actually happen, say due to some random OOM error during that small window, the consequence is no worse than an annoying log message: "could not unlink file..." (this would come from the second backend that attempted an unlink()). You would not see this when a worker raised an error due to a duplicate violation, or any other routine problem, so it should really be almost impossible. That having been said, this probably *is* a problematic restriction in cases where a temp file's ownership is not immediately handed over without concurrent sharing. What happens to be a small window for the parallel CREATE INDEX patch probably wouldn't be a small window for parallel hash join. :-( It's not hard to see why I would like to do things this way. Just look at ResourceOwnerReleaseInternal(). Any release of a file happens during RESOURCE_RELEASE_AFTER_LOCKS, whereas the release of dynamic shared memory segments happens earlier, during RESOURCE_RELEASE_BEFORE_LOCKS. ISTM that the only sensible way to implement a refcount is using dynamic shared memory, and that seems hard. There are additional reasons why I suggest we go this way, such as the fact that all the relevant state belongs to BufFile, which is implemented a layer above all of the guts of resource management of temp files within fd.c. I'd have to replicate almost all state in fd.c to make it all work, which seems like a big modularity violation. Does anyone have any suggestions on how to tackle this? -- Peter Geoghegan
On Tue, Feb 7, 2017 at 5:43 PM, Peter Geoghegan <pg@bowt.ie> wrote: > However, there are some specific implementation issues with this that > I didn't quite anticipate. I would like to get feedback on these > issues now, from both Thomas and Robert. The issues relate to how much > the patch can or should "buy into resource management". You might > guess that this new resource management code is something that should > live in fd.c, alongside the guts of temp file resource management, > within the function FileClose(). That way, it would be called by every > possible path that might delete a temp file, including > ResourceOwnerReleaseInternal(). That's not what I've done, though. > Instead, refcount management is limited to a few higher level routines > in buffile.c. Initially, resource management in FileClose() is made to > assume that it must delete the file. Then, if and when directed to by > BufFileClose()/refcount, a backend may determine that it is not its > job to do the deletion -- it will not be the one that must "turn out > the lights", and so indicates to FileClose() that it should not delete > the file after all (it should just release vFDs, close(), and so on). > Otherwise, when refcount reaches zero, temp files are deleted by > FileClose() in more or less the conventional manner. > > The fact that there could, in general, be any error that causes us to > attempt a double-deletion (deletion of a temp file from more than one > backend) for a time is less of a problem than you might think. This is > because there is a risk of this only for as long as two backends hold > open the file at the same time. In the case of parallel CREATE INDEX, > this is now the shortest possible period of time, since workers close > their files using BufFileClose() immediately after the leader wakes > them up from a quiescent state. And, if that were to actually happen, > say due to some random OOM error during that small window, the > consequence is no worse than an annoying log message: "could not > unlink file..." (this would come from the second backend that > attempted an unlink()). You would not see this when a worker raised an > error due to a duplicate violation, or any other routine problem, so > it should really be almost impossible. > > That having been said, this probably *is* a problematic restriction in > cases where a temp file's ownership is not immediately handed over > without concurrent sharing. What happens to be a small window for the > parallel CREATE INDEX patch probably wouldn't be a small window for > parallel hash join. :-( > > It's not hard to see why I would like to do things this way. Just look > at ResourceOwnerReleaseInternal(). Any release of a file happens > during RESOURCE_RELEASE_AFTER_LOCKS, whereas the release of dynamic > shared memory segments happens earlier, during > RESOURCE_RELEASE_BEFORE_LOCKS. ISTM that the only sensible way to > implement a refcount is using dynamic shared memory, and that seems > hard. There are additional reasons why I suggest we go this way, such > as the fact that all the relevant state belongs to BufFile, which is > implemented a layer above all of the guts of resource management of > temp files within fd.c. I'd have to replicate almost all state in fd.c > to make it all work, which seems like a big modularity violation. > > Does anyone have any suggestions on how to tackle this? Hmm. One approach might be like this: 1. There is a shared refcount which is incremented when you open a shared file and decremented if you optionally explicitly 'release' it. (Not when you close it, because we can't allow code that may be run during RESOURCE_RELEASE_AFTER_LOCKS to try to access the DSM segment after it has been unmapped; more generally, creating destruction order dependencies between different kinds of resource-manager-cleaned-up objects seems like a bad idea. Of course the close code still looks after closing the vfds in the local backend.) 2. If you want to hand the file over to some other process and exit, you probably want to avoid race conditions or extra IPC burden. To achieve that you could 'pin' the file, so that it survives even while not open in any backend. 3. If the recount reaches zero when you 'release' and the file isn't 'pinned', then you must delete the underlying files. 4. When the DSM segment is detached, we spin through all associated shared files that we're still 'attached' to (ie opened but didn't release) and decrement the refcount. If any shared file's refcount reaches zero its files should be deleted, even if was 'pinned'. In other words, the associated DSM segment's lifetime is the maximum lifetime of shared files, but it can be shorter if you 'release' in all backends and don't 'pin'. It's up to client code can come up with some scheme to make that work, if it doesn't take the easy route of pinning until DSM segment destruction. I think in your case you'd simply pin all the BufFiles allowing workers to exit when they're done; the leader would wait for all workers to indicate they'd finished, and then open the files. The files would be deleted eventually when the last process detaches from the DSM segment (very likely the leader). In my case I'd pin all shared BufFiles and then release them when I'd finished reading them back in and didn't need them anymore, and unpin them in the first participant to discover that the end had been reached (it would be a programming error to pin twice or unpin twice, like similarly named operations for DSM segments and DSA areas). That'd preserve the existing Hash Join behaviour of deleting batch files as soon as possible, but also guarantee cleanup in any error case. There is something a bit unpleasant about teaching other subsystems about the existence of DSM segments just to be able to use DSM lifetime as a cleanup scope. I do think dsm_on_detach is a pretty good place to do cleanup of resources in parallel computing cases like ours, but I wonder if we could introduce a more generic destructor callback interface which DSM segments could provide. -- Thomas Munro http://www.enterprisedb.com
On Wed, Feb 8, 2017 at 8:40 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Tue, Feb 7, 2017 at 5:43 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> Does anyone have any suggestions on how to tackle this? > > Hmm. One approach might be like this: > > [hand-wavy stuff] Thinking a bit harder about this, I suppose there could be a kind of object called a SharedBufFileManager (insert better name) which you can store in a DSM segment. The leader backend that initialises a DSM segment containing one of these would then call a constructor function that sets an internal refcount to 1 and registers an on_dsm_detach callback for its on-detach function. All worker backends that attach to the DSM segment would need to call an attach function for the SharedBufFileManager to increment a refcount and also register the on_dsm_detach callback, before any chance that an error might be thrown (is that difficult?); failure to do so could result in file leaks. Then, when a BufFile is to be shared (AKA exported, made unifiable), a SharedBufFile object can be initialised somewhere in the same DSM segment and registered with the SharedBufFileManager. Internally all registered SharedBufFile objects would be linked together using offsets from the start of the DSM segment for link pointers. Now when SharedBufFileManager's on-detach function runs, it decrements the refcount in the SharedBufFileManager, and if that reaches zero then it runs a destructor that spins through the list of SharedBufFile objects deleting files that haven't already been deleted explicitly. I retract the pin/unpin and per-file refcounting stuff I mentioned earlier. You could make the default that all files registered with a SharedBufFileManager survive until the containing DSM segment is detached everywhere using that single refcount in the SharedBufFileManager object, but also provide a 'no really delete this particular shared file now' operation for client code that knows it's safe to do that sooner (which would be the case for me, I think). I don't think per-file refcounts are needed. There are a couple of problems with the above though. Firstly, doing reference counting in DSM segment on-detach hooks is really a way to figure out when the DSM segment is about to be destroyed by keeping a separate refcount in sync with the DSM segment's refcount, but it doesn't account for pinned DSM segments. It's not your use-case or mine currently, but someone might want a DSM segment to live even when it's not attached anywhere, to be reattached later. If we're trying to use DSM segment lifetime as a scope, we'd be ignoring this detail. Perhaps instead of adding our own refcount we need a new kind of hook on_dsm_destroy. Secondly, I might not want to be constrained by a fixed-sized DSM segment to hold my SharedBufFile objects... there are cases where I need to shared a number of batch files that is unknown at the start of execution time when the DSM segment is sized (I'll write about that shortly on the Parallel Shared Hash thread). Maybe I can find a way to get rid of that requirement. Or maybe it could support DSA memory too, but I don't think it's possible to use on_dsm_detach-based cleanup routines that refer to DSA memory because by the time any given DSM segment's detach hook runs, there's no telling which other DSM segments have been detached already, so the DSA area may already have partially vanished; some other kind of hook that runs earlier would be needed... Hmm. -- Thomas Munro http://www.enterprisedb.com
On Wed, Feb 8, 2017 at 5:36 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Thinking a bit harder about this, I suppose there could be a kind of > object called a SharedBufFileManager (insert better name) which you > can store in a DSM segment. The leader backend that initialises a DSM > segment containing one of these would then call a constructor function > that sets an internal refcount to 1 and registers an on_dsm_detach > callback for its on-detach function. All worker backends that attach > to the DSM segment would need to call an attach function for the > SharedBufFileManager to increment a refcount and also register the > on_dsm_detach callback, before any chance that an error might be > thrown (is that difficult?); failure to do so could result in file > leaks. Then, when a BufFile is to be shared (AKA exported, made > unifiable), a SharedBufFile object can be initialised somewhere in the > same DSM segment and registered with the SharedBufFileManager. > Internally all registered SharedBufFile objects would be linked > together using offsets from the start of the DSM segment for link > pointers. Now when SharedBufFileManager's on-detach function runs, it > decrements the refcount in the SharedBufFileManager, and if that > reaches zero then it runs a destructor that spins through the list of > SharedBufFile objects deleting files that haven't already been deleted > explicitly. I think this is approximately reasonable, but I think it could be made simpler by having fewer separate objects. Let's assume the leader can put an upper bound on the number of shared BufFiles at the time it's sizing the DSM segment (i.e. before InitializeParallelDSM). Then it can allocate a big ol' array with a header indicating the array size and each element containing enough space to identify the relevant details of 1 shared BufFile. Now you don't need to do any allocations later on, and you don't need a linked list. You just loop over the array and do what needs doing. > There are a couple of problems with the above though. Firstly, doing > reference counting in DSM segment on-detach hooks is really a way to > figure out when the DSM segment is about to be destroyed by keeping a > separate refcount in sync with the DSM segment's refcount, but it > doesn't account for pinned DSM segments. It's not your use-case or > mine currently, but someone might want a DSM segment to live even when > it's not attached anywhere, to be reattached later. If we're trying > to use DSM segment lifetime as a scope, we'd be ignoring this detail. > Perhaps instead of adding our own refcount we need a new kind of hook > on_dsm_destroy. I think it's good enough to plan for current needs now. It's not impossible to change this stuff later, but we need something that works robustly right now without being too invasive. Inventing whole new system concepts because of stuff we might someday want to do isn't a good idea because we may easily guess wrong about what direction we'll want to go in the future. This is more like building a wrench than a 747: a 747 needs to be extensible and reconfigurable and upgradable because it costs $350 million. A wrench costs $10 at Walmart and if it turns out we bought the wrong one, we can just throw it out and get a different one later. > Secondly, I might not want to be constrained by a > fixed-sized DSM segment to hold my SharedBufFile objects... there are > cases where I need to shared a number of batch files that is unknown > at the start of execution time when the DSM segment is sized (I'll > write about that shortly on the Parallel Shared Hash thread). Maybe I > can find a way to get rid of that requirement. Or maybe it could > support DSA memory too, but I don't think it's possible to use > on_dsm_detach-based cleanup routines that refer to DSA memory because > by the time any given DSM segment's detach hook runs, there's no > telling which other DSM segments have been detached already, so the > DSA area may already have partially vanished; some other kind of hook > that runs earlier would be needed... Again, wrench. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 10, 2017 at 9:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Feb 8, 2017 at 5:36 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> Thinking a bit harder about this, I suppose there could be a kind of >> object called a SharedBufFileManager [... description of that ...]. > > I think this is approximately reasonable, but I think it could be made > simpler by having fewer separate objects. Let's assume the leader can > put an upper bound on the number of shared BufFiles at the time it's > sizing the DSM segment (i.e. before InitializeParallelDSM). Then it > can allocate a big ol' array with a header indicating the array size > and each element containing enough space to identify the relevant > details of 1 shared BufFile. Now you don't need to do any allocations > later on, and you don't need a linked list. You just loop over the > array and do what needs doing. Makes sense. >> There are a couple of problems with the above though. Firstly, doing >> reference counting in DSM segment on-detach hooks is really a way to >> figure out when the DSM segment is about to be destroyed by keeping a >> separate refcount in sync with the DSM segment's refcount, but it >> doesn't account for pinned DSM segments. It's not your use-case or >> mine currently, but someone might want a DSM segment to live even when >> it's not attached anywhere, to be reattached later. If we're trying >> to use DSM segment lifetime as a scope, we'd be ignoring this detail. >> Perhaps instead of adding our own refcount we need a new kind of hook >> on_dsm_destroy. > > I think it's good enough to plan for current needs now. It's not > impossible to change this stuff later, but we need something that > works robustly right now without being too invasive. Inventing whole > new system concepts because of stuff we might someday want to do isn't > a good idea because we may easily guess wrong about what direction > we'll want to go in the future. This is more like building a wrench > than a 747: a 747 needs to be extensible and reconfigurable and > upgradable because it costs $350 million. A wrench costs $10 at > Walmart and if it turns out we bought the wrong one, we can just throw > it out and get a different one later. I agree that the pinned segment case doesn't matter right now, I just wanted to point it out. I like your $10 wrench analogy, but maybe it could be argued that adding a dsm_on_destroy() callback mechanism is not only better than adding another refcount to track that other refcount, but also a steal at only $8. >> Secondly, I might not want to be constrained by a >> fixed-sized DSM segment to hold my SharedBufFile objects... there are >> cases where I need to shared a number of batch files that is unknown >> at the start of execution time when the DSM segment is sized (I'll >> write about that shortly on the Parallel Shared Hash thread). Maybe I >> can find a way to get rid of that requirement. Or maybe it could >> support DSA memory too, but I don't think it's possible to use >> on_dsm_detach-based cleanup routines that refer to DSA memory because >> by the time any given DSM segment's detach hook runs, there's no >> telling which other DSM segments have been detached already, so the >> DSA area may already have partially vanished; some other kind of hook >> that runs earlier would be needed... > > Again, wrench. My problem here is that I don't know how many batches I'll finish up creating. In general that's OK because I can hold onto them as private BufFiles owned by participants with the existing cleanup mechanism, and then share them just before they need to be shared (ie when we switch to processing the next batch so they need to be readable by all). Now I only ever share one inner and one outer batch file per participant at a time, and then I explicitly delete them at a time that I know to be safe and before I need to share a new file that would involve recycling the slot, and I'm relying on DSM segment scope cleanup only to handle error paths. That means that in generally I only need space for 2 * P shared BufFiles at a time. But there is a problem case: when the leader needs to exit early, it needs to be able to transfer ownership of any files it has created, which could be more than we planned for, and then not participate any further in the hash join, so it can't participate in the on-demand sharing scheme. Perhaps we can find a way to describe a variable number of BufFiles (ie batches) in a fixed space by making sure the filenames are constructed in a way that lets us just have to say how many there are. Then the next problem is that for each BufFile we have to know how many 1GB segments there are to unlink (files named foo, foo.1, foo.2, ...), which Peter's code currently captures by publishing the file size in the descriptor... but if a fixed size object must describe N BufFiles, where can I put the size of each one? Maybe I could put it in a header of the file itself (yuck!), or maybe I could decide that I don't care what the size is, I'll simply unlink "foo", then "foo.1", then "foo.2", ... until I get ENOENT. Alternatively I might get rid of the requirement for the leader to drop out of processing later batches. I'm about to post a message to the other thread about how to do that, but it's complicated and I'm currently working on the assumption that the PSH patch is useful without it (but let's not discuss that in this thread). That would have the side effect of getting rid of the requirement to share a number of BufFiles that isn't known up front. -- Thomas Munro http://www.enterprisedb.com
On Thu, Feb 9, 2017 at 5:09 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I agree that the pinned segment case doesn't matter right now, I just > wanted to point it out. I like your $10 wrench analogy, but maybe it > could be argued that adding a dsm_on_destroy() callback mechanism is > not only better than adding another refcount to track that other > refcount, but also a steal at only $8. If it's that simple, it might be worth doing, but I bet it's not. One problem is that there's a race condition: there will inevitably be a period of time after you've called dsm_attach() and before you've attached to the specific data structure that we're talking about here. So suppose the last guy who actually knows about this data structure dies horribly and doesn't clean up because the DSM isn't being destroyed; moments later, you die horribly before reaching the code where you attach to this data structure. Oops. You might think about plugging that hole by moving the registry of on-destroy functions into the segment itself and making it a shared resource. But ASLR breaks that, especially for loadable modules. You could try to fix that problem, in turn, by storing arguments that can later be passed to load_external_function() instead of a function pointer per se. But that sounds pretty fragile because some other backend might not try to load the module until after it's attached the DSM segment and it might then fail because loading the module runs _PG_init() which can throw errors. Maybe you can think of a way to plug that hole too but you're waaaaay over your $8 budget by this point. >>> Secondly, I might not want to be constrained by a >>> fixed-sized DSM segment to hold my SharedBufFile objects... there are >>> cases where I need to shared a number of batch files that is unknown >>> at the start of execution time when the DSM segment is sized (I'll >>> write about that shortly on the Parallel Shared Hash thread). Maybe I >>> can find a way to get rid of that requirement. Or maybe it could >>> support DSA memory too, but I don't think it's possible to use >>> on_dsm_detach-based cleanup routines that refer to DSA memory because >>> by the time any given DSM segment's detach hook runs, there's no >>> telling which other DSM segments have been detached already, so the >>> DSA area may already have partially vanished; some other kind of hook >>> that runs earlier would be needed... >> >> Again, wrench. > > My problem here is that I don't know how many batches I'll finish up > creating. In general that's OK because I can hold onto them as > private BufFiles owned by participants with the existing cleanup > mechanism, and then share them just before they need to be shared (ie > when we switch to processing the next batch so they need to be > readable by all). Now I only ever share one inner and one outer batch > file per participant at a time, and then I explicitly delete them at a > time that I know to be safe and before I need to share a new file that > would involve recycling the slot, and I'm relying on DSM segment scope > cleanup only to handle error paths. That means that in generally I > only need space for 2 * P shared BufFiles at a time. But there is a > problem case: when the leader needs to exit early, it needs to be able > to transfer ownership of any files it has created, which could be more > than we planned for, and then not participate any further in the hash > join, so it can't participate in the on-demand sharing scheme. I thought the idea was that the structure we're talking about here owns all the files, up to 2 from a leader that wandered off plus up to 2 for each worker. Last process standing removes them. Or are you saying each worker only needs 2 files but the leader needs a potentially unbounded number? > Perhaps we can find a way to describe a variable number of BufFiles > (ie batches) in a fixed space by making sure the filenames are > constructed in a way that lets us just have to say how many there are. That could be done. > Then the next problem is that for each BufFile we have to know how > many 1GB segments there are to unlink (files named foo, foo.1, foo.2, > ...), which Peter's code currently captures by publishing the file > size in the descriptor... but if a fixed size object must describe N > BufFiles, where can I put the size of each one? Maybe I could put it > in a header of the file itself (yuck!), or maybe I could decide that I > don't care what the size is, I'll simply unlink "foo", then "foo.1", > then "foo.2", ... until I get ENOENT. There's nothing wrong with that algorithm as far as I'm concerned. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 10, 2017 at 11:31 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 9, 2017 at 5:09 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> I agree that the pinned segment case doesn't matter right now, I just >> wanted to point it out. I like your $10 wrench analogy, but maybe it >> could be argued that adding a dsm_on_destroy() callback mechanism is >> not only better than adding another refcount to track that other >> refcount, but also a steal at only $8. > > If it's that simple, it might be worth doing, but I bet it's not. One > problem is that there's a race condition: there will inevitably be a > period of time after you've called dsm_attach() and before you've > attached to the specific data structure that we're talking about here. > So suppose the last guy who actually knows about this data structure > dies horribly and doesn't clean up because the DSM isn't being > destroyed; moments later, you die horribly before reaching the code > where you attach to this data structure. Oops. Right, I mentioned this problem earlier ("and also register the on_dsm_detach callback, before any chance that an error might be thrown (is that difficult?); failure to do so could result in file leaks"). Here's my thought process... please tell me where I'm going wrong: I have been assuming that it's not enough to just deal with this when the leader detaches on the theory that other participants will always detach first: that probably isn't true in some error cases, and could contribute to spurious racy errors where other workers complain about disappearing files if the leader somehow shuts down and cleans up while a worker is still running. Therefore we need *some* kind of refcounting, whether it's a new kind or a new mechanism based on the existing kind. I have also been assuming that we don't want to teach dsm.c directly about this stuff; it shouldn't need to know about other modules, so we don't want it talking to buffile.c directly and managing a special table of files; instead we want a system of callbacks. Therefore client code needs to do something after attaching to the segment in each backend. It doesn't matter whether we use an on_dsm_detach() callback and manage our own refcount to infer that destruction is imminent, or a new on_dsm_destroy() callback which tells us so explicitly: both ways we'll need to make sure that anyone who attaches to the segment also "attaches" to this shared BufFile manager object inside it, because any backend might turn out to be the one that is last to detach. That bring us to the race you mentioned. Isn't it sufficient to say that you aren't allowed to do anything that might throw in between attaching to the segment and attaching to the SharedBufFileManager that it contains? Up until two minutes ago I assumed that policy would leave only two possibilities: you attach to the DSM segment and attach to the SharedBufFileManager successfully or you attach to the DSM segment and then die horribly (but not throw) and the postmaster restarts the whole cluster and blows all temp files away with RemovePgTempFiles(). But I see now in the comment of that function that crash-induced restarts don't call that because "someone might want to examine the temp files for debugging purposes". Given that policy for regular private BufFiles, I don't see why that shouldn't apply equally to shared files: after a crash restart, you may have some junk files that won't be cleaned up until your next clean restart, whether they were private or shared BufFiles. > You might think about plugging that hole by moving the registry of > on-destroy functions into the segment itself and making it a shared > resource. But ASLR breaks that, especially for loadable modules. You > could try to fix that problem, in turn, by storing arguments that can > later be passed to load_external_function() instead of a function > pointer per se. But that sounds pretty fragile because some other > backend might not try to load the module until after it's attached the > DSM segment and it might then fail because loading the module runs > _PG_init() which can throw errors. Maybe you can think of a way to > plug that hole too but you're waaaaay over your $8 budget by this > point. Agreed, those approaches seem like a non-starters. >> My problem here is that I don't know how many batches I'll finish up >> creating. [...] > > I thought the idea was that the structure we're talking about here > owns all the files, up to 2 from a leader that wandered off plus up to > 2 for each worker. Last process standing removes them. Or are you > saying each worker only needs 2 files but the leader needs a > potentially unbounded number? Yes, potentially unbounded in rare case. If we plan for N batches, and then run out of work_mem because our estimates were just wrong or the distributions of keys is sufficiently skewed, we'll run HashIncreaseNumBatches, and that could happen more than once. I have a suite of contrived test queries that hits all the various modes and code paths of hash join, and it includes a query that plans for one batch but finishes up creating many, and then the leader exits. I'll post that to the other thread along with my latest patch series soon. >> Perhaps we can find a way to describe a variable number of BufFiles >> (ie batches) in a fixed space by making sure the filenames are >> constructed in a way that lets us just have to say how many there are. > > That could be done. Cool. >> Then the next problem is that for each BufFile we have to know how >> many 1GB segments there are to unlink (files named foo, foo.1, foo.2, >> ...), which Peter's code currently captures by publishing the file >> size in the descriptor... but if a fixed size object must describe N >> BufFiles, where can I put the size of each one? Maybe I could put it >> in a header of the file itself (yuck!), or maybe I could decide that I >> don't care what the size is, I'll simply unlink "foo", then "foo.1", >> then "foo.2", ... until I get ENOENT. > > There's nothing wrong with that algorithm as far as I'm concerned. Cool. -- Thomas Munro http://www.enterprisedb.com
On Thu, Feb 9, 2017 at 2:31 PM, Robert Haas <robertmhaas@gmail.com> wrote: > You might think about plugging that hole by moving the registry of > on-destroy functions into the segment itself and making it a shared > resource. But ASLR breaks that, especially for loadable modules. You > could try to fix that problem, in turn, by storing arguments that can > later be passed to load_external_function() instead of a function > pointer per se. But that sounds pretty fragile because some other > backend might not try to load the module until after it's attached the > DSM segment and it might then fail because loading the module runs > _PG_init() which can throw errors. Maybe you can think of a way to > plug that hole too but you're waaaaay over your $8 budget by this > point. At the risk of stating the obvious, ISTM that the right way to do this, at a high level, is to err on the side of unneeded extra unlink() calls, not leaking files. And, to make the window for problem ("remaining hole that you haven't quite managed to plug") practically indistinguishable from no hole at all, in a way that's kind of baked into the API. It's not like we currently throw an error when there is a problem with deleting temp files that are no longer needed on resource manager cleanup. We simply log the fact that it happened, and limp on. I attach my V8. This does not yet do anything with on_dsm_detach(). I've run out of time to work on it this week, and am starting a new job next week at VMware, which I'll need time to settle into. So I'm posting this now, since you can still very much see the direction I'm going in, and can give me any feedback that you have. If anyone wants to show me how its done by building on this, and finishing what I have off, be my guest. The new stuff probably isn't quite as polished as I would prefer, but time grows short, so I won't withhold it. Changes: * Implements refcount thing, albeit in a way that leaves a small window for double unlink() calls if there is an error during the small window in which there is worker/leader co-ownership of a BufFile (just add an "elog(ERROR)" just before leader-as-worker Tuplesort state is ended within _bt_leafbuild() to see what I mean). This implies that background workers can be reclaimed once the leader needs to start its final on-the-fly merge, which is nice. As an example of how that's nice, this change makes maintenance_work_mem a budget that we more strictly adhere to. * Fixes bitrot caused by recent logtape.c bugfix in master branch. * No local segment is created during unification unless and until one is required. (In practice, for current use of BufFile infrastructure, no "local" segment is ever created, even if we force a randomAccess case using one of the testing GUCs from 0002-* -- we'd have to use another GUC to *also* force there to be no reclaimation.) * Better testing. As I just mentioned, we can now force logtape.c to not reclaim blocks, so you make new local segments as part of a unified BufFile, which have different considerations from a resource management point of view. Despite being part of the same "unified" BufFile from the leader's perspective, it behaves like a local segment, so it definitely seems like a good idea to have test coverage for this, at least during development. (I have a pretty rough test suite that I'm using; development of this patch has been somewhat test driven.) * Better encapsulation of BufFile stuff. I am even closer to the ideal of this whole sharing mechanism being a fairly generic BufFile thing that logtape.c piggy-backs on without having special knowledge of the mechanism. It's still true that the mechanism (sharing/unification) is written principally with logtape.c in mind, but that's just because of its performance characteristics. Nothing to do with the interface. * Worked through items raised by Thomas in his 2017-01-30 mail to this thread. >>>> Secondly, I might not want to be constrained by a >>>> fixed-sized DSM segment to hold my SharedBufFile objects... there are >>>> cases where I need to shared a number of batch files that is unknown >>>> at the start of execution time when the DSM segment is sized (I'll >>>> write about that shortly on the Parallel Shared Hash thread). Maybe I >>>> can find a way to get rid of that requirement. Or maybe it could >>>> support DSA memory too, but I don't think it's possible to use >>>> on_dsm_detach-based cleanup routines that refer to DSA memory because >>>> by the time any given DSM segment's detach hook runs, there's no >>>> telling which other DSM segments have been detached already, so the >>>> DSA area may already have partially vanished; some other kind of hook >>>> that runs earlier would be needed... >>> >>> Again, wrench. I like the wrench analogy too, FWIW. >> My problem here is that I don't know how many batches I'll finish up >> creating. In general that's OK because I can hold onto them as >> private BufFiles owned by participants with the existing cleanup >> mechanism, and then share them just before they need to be shared (ie >> when we switch to processing the next batch so they need to be >> readable by all). Now I only ever share one inner and one outer batch >> file per participant at a time, and then I explicitly delete them at a >> time that I know to be safe and before I need to share a new file that >> would involve recycling the slot, and I'm relying on DSM segment scope >> cleanup only to handle error paths. That means that in generally I >> only need space for 2 * P shared BufFiles at a time. But there is a >> problem case: when the leader needs to exit early, it needs to be able >> to transfer ownership of any files it has created, which could be more >> than we planned for, and then not participate any further in the hash >> join, so it can't participate in the on-demand sharing scheme. I think that parallel CREATE INDEX can easily live with the restriction that we need to know how many shared BufFiles are needed up front. It will either be 1, or 2 (when there are 2 nbtsort.c spools, for unique index builds). We can also detect when the limit is already exceeded early, and back out, just as we do when there are no parallel workers currently available. >> Then the next problem is that for each BufFile we have to know how >> many 1GB segments there are to unlink (files named foo, foo.1, foo.2, >> ...), which Peter's code currently captures by publishing the file >> size in the descriptor... but if a fixed size object must describe N >> BufFiles, where can I put the size of each one? Maybe I could put it >> in a header of the file itself (yuck!), or maybe I could decide that I >> don't care what the size is, I'll simply unlink "foo", then "foo.1", >> then "foo.2", ... until I get ENOENT. > > There's nothing wrong with that algorithm as far as I'm concerned. I would like to point out, just to be completely clear, that while this V8 doesn't "do refcounts properly" (it doesn't use a on_dsm_detach() hook and so on), the only benefit that that would actually have for parallel CREATE INDEX is that it makes it impossible that the user could see a spurious ENOENT related log message during unlink() (I err on the side of doing too much unlinking, not too little). Which is very unlikely anyway. So, if that's okay for parallel hash join, as indicated by Robert here, an issue like that would presumably also be okay for parallel CREATE INDEX. It then follows that what I'm missing here is something that is only really needed for the parallel hash join patch anyway. I really want to help Thomas, and am not shirking what I feel is a responsibility to assist him. I have every intention of breaking this down to produce a usable patch that only has the BufFile + resource managemnt stuff, that follows the interface he sketched as a requirement for me in his most recent revision of his patch series ("0009-hj-shared-buffile-strawman-v4.patch"). I'm just pointing out that my patch is reasonably complete as a standalone piece of work right now, AFAICT. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here's my thought process... please tell me where I'm going wrong: > > I have been assuming that it's not enough to just deal with this when > the leader detaches on the theory that other participants will always > detach first: that probably isn't true in some error cases, and could > contribute to spurious racy errors where other workers complain about > disappearing files if the leader somehow shuts down and cleans up > while a worker is still running. Therefore we need *some* kind of > refcounting, whether it's a new kind or a new mechanism based on the > existing kind. +1. > I have also been assuming that we don't want to teach dsm.c directly > about this stuff; it shouldn't need to know about other modules, so we > don't want it talking to buffile.c directly and managing a special > table of files; instead we want a system of callbacks. Therefore > client code needs to do something after attaching to the segment in > each backend. +1. > It doesn't matter whether we use an on_dsm_detach() callback and > manage our own refcount to infer that destruction is imminent, or a > new on_dsm_destroy() callback which tells us so explicitly: both ways > we'll need to make sure that anyone who attaches to the segment also > "attaches" to this shared BufFile manager object inside it, because > any backend might turn out to be the one that is last to detach. Not entirely. In the first case, you don't need the requirement that everyone who attaches the segment must attach to the shared BufFile manager. In the second case, you do. > That bring us to the race you mentioned. Isn't it sufficient to say > that you aren't allowed to do anything that might throw in between > attaching to the segment and attaching to the SharedBufFileManager > that it contains? That would be sufficient, but I think it's not a very good design. It means, for example, that nothing between the time you attach to the segment and the time you attach to this manager can palloc() anything. So, for example, it would have to happen before ParallelWorkerMain reaches the call to shm_mq_attach, which kinda sucks because we want to do that as soon as possible after attaching to the DSM segment so that errors are reported properly thereafter. Note that's the very first thing we do now, except for working out what the arguments to that call need to be. Also, while it's currently safe to assume that shm_toc_attach() and shm_toc_lookup() don't throw errors, I've thought about the possibility of installing some sort of cache in shm_toc_lookup() to amortize the cost of lookups, if the number of keys ever got too large. And that would then require a palloc(). Generally, backend code should be free to throw errors. When it's absolutely necessary for a short segment of code to avoid that, then we do, but you can't really rely on any substantial amount of code to be that way, or stay that way. And in this case, even if we didn't mind those problems or had some solution to them, I think that the shared buffer manager shouldn't have to be something that is whacked directly into parallel.c all the way at the beginning of the initialization sequence so that nothing can fail before it happens. I think it should be an optional data structure that clients of the parallel infrastructure can decide to use, or to not use. It should be at arm's length from the core code, just like the way ParallelQueryMain() is distinct from ParallelWorkerMain() and sets up its own set of data structures with their own set of keys. All that stuff is happy to happen after whatever ParallelWorkerMain() feels that it needs to do, even if ParallelWorkerMain might throw errors for any number of unknown reasons. Similarly, I think this new things should be something than an executor node can decide to create inside its own per-node space -- reserved via ExecParallelEstimate, initialized ExecParallelInitializeDSM, etc. There's no need for it to be deeply coupled to parallel.c itself unless we force that choice by sticking a no-fail requirement in there. > Up until two minutes ago I assumed that policy would leave only two > possibilities: you attach to the DSM segment and attach to the > SharedBufFileManager successfully or you attach to the DSM segment and > then die horribly (but not throw) and the postmaster restarts the > whole cluster and blows all temp files away with RemovePgTempFiles(). > But I see now in the comment of that function that crash-induced > restarts don't call that because "someone might want to examine the > temp files for debugging purposes". Given that policy for regular > private BufFiles, I don't see why that shouldn't apply equally to > shared files: after a crash restart, you may have some junk files that > won't be cleaned up until your next clean restart, whether they were > private or shared BufFiles. I think most people (other than Tom) would agree that that policy isn't really sensible any more; it probably made sense when the PostgreSQL user community was much smaller and consisted mostly of the people developing PostgreSQL, but these days it's much more likely to cause operational headaches than to help a developer debug. Regardless, I think the primary danger isn't failure to remove a file (although that is best avoided) but removing one too soon (causing someone else to error when opening it, or on Windows causing the delete itself to error out). It's not really OK for random stuff to throw errors in corner cases because we were too lazy to ensure that cleanup operations happen in the right order. >> I thought the idea was that the structure we're talking about here >> owns all the files, up to 2 from a leader that wandered off plus up to >> 2 for each worker. Last process standing removes them. Or are you >> saying each worker only needs 2 files but the leader needs a >> potentially unbounded number? > > Yes, potentially unbounded in rare case. If we plan for N batches, > and then run out of work_mem because our estimates were just wrong or > the distributions of keys is sufficiently skewed, we'll run > HashIncreaseNumBatches, and that could happen more than once. I have > a suite of contrived test queries that hits all the various modes and > code paths of hash join, and it includes a query that plans for one > batch but finishes up creating many, and then the leader exits. I'll > post that to the other thread along with my latest patch series soon. Hmm, OK. So that's going to probably require something where a fixed amount of DSM can describe an arbitrary number of temp file series. But that also means this is an even-more-special-purpose tool that shouldn't be deeply tied into parallel.c so that it can run before any errors happen. Basically, I think the "let's write the code between here and here so it throws no errors" technique is, for 99% of PostgreSQL programming, difficult and fragile. We shouldn't rely on it if there is some other reasonable option. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Feb 4, 2017 at 2:45 PM, Peter Geoghegan <pg@bowt.ie> wrote: > It might just have been that the table was too small to be an > effective target for parallel sequential scan with so many workers, > and so a presorted best case CREATE INDEX, which isn't that different, > also fails to see much benefit (compared to what you'd see with a > similar case involving a larger table). In other words, I might have > jumped the gun in emphasizing issues with hardware and I/O bandwidth > over issues around data volume (that I/O parallelism is inherently not > very helpful with these relatively small tables). > > As I've pointed out a couple of times before, bigger sorts will be > more CPU bound because sorting itself has costs that grow > linearithmically, whereas writing out runs has costs that grow > linearly. The relative cost of the I/O can be expected to go down as > input goes up for this reason. At the same time, a larger input might > make better use of I/O parallelism, which reduces the cost paid in > latency to write out runs in absolute terms. Here are some results with your latest patch, using the same test as before but this time with SCALE=100 (= 100,000,000 rows). The table sizes are: List of relations Schema | Name | Type | Owner | Size | Description --------+----------------------+-------+--------------+-------+------------- public | million_words | table | thomas.munro | 42 MB | public | some_words | table | thomas.munro | 19 MB | public | test_intwide_u_asc | table | thomas.munro | 18 GB | public | test_intwide_u_desc | table | thomas.munro | 18 GB | public | test_intwide_u_rand | table | thomas.munro | 18 GB | public | test_textwide_u_asc | table | thomas.munro | 19 GB | public | test_textwide_u_desc | table | thomas.munro | 19 GB | public | test_textwide_u_rand | table | thomas.munro | 19 GB | To reduce the number of combinations I did only unique data and built only non-unique indexes with only 'wide' tuples (= key plus a text column that holds a 151-character wide string, rather than just the key), and also didn't bother with the 1MB memory size as suggested. Here are the results up to 4 workers (a results table going up to 8 workers is attached, since it wouldn't format nicely if I pasted it here). Again, the w = 0 time is seconds, the rest show relative speed-up. This data was all in the OS page cache because of a dummy run done first, and I verified with 'sar' that there was exactly 0 reading from the block device. The CPU was pegged on leader + workers during sort runs, and then the leader's CPU hovered around 93-98% during the merge/btree build. I had some technical problems getting a cold-cache read-from-actual-disk-each-time test run to work properly, but can go back and do that again if anyone thinks that would be interesting data to see. tab | ord | mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 ----------+------+-----+---------+-------+-------+-------+------- intwide | asc | 64 | 67.91 | 1.26x | 1.46x | 1.62x | 1.73x intwide | asc | 256 | 67.84 | 1.23x | 1.48x | 1.63x | 1.79x intwide | asc | 512 | 69.01 | 1.25x | 1.50x | 1.63x | 1.80x intwide | desc | 64 | 98.08 | 1.48x | 1.83x | 2.03x | 2.25x intwide | desc | 256 | 99.87 | 1.43x | 1.80x | 2.03x | 2.29x intwide | desc | 512 | 104.09 | 1.44x | 1.85x | 2.09x | 2.33x intwide | rand | 64 | 138.03 | 1.56x | 2.04x | 2.42x | 2.58x intwide | rand | 256 | 139.44 | 1.61x | 2.04x | 2.38x | 2.56x intwide | rand | 512 | 138.96 | 1.52x | 2.03x | 2.28x | 2.57x textwide | asc | 64 | 207.10 | 1.20x | 1.07x | 1.09x | 1.11x textwide | asc | 256 | 200.62 | 1.19x | 1.06x | 1.04x | 0.99x textwide | asc | 512 | 191.42 | 1.16x | 0.97x | 1.01x | 0.94x textwide | desc | 64 | 1382.48 | 1.89x | 2.37x | 3.18x | 3.87x textwide | desc | 256 | 1427.99 | 1.89x | 2.42x | 3.24x | 4.00x textwide | desc | 512 | 1453.21 | 1.86x | 2.39x | 3.23x | 3.75x textwide | rand | 64 | 1587.28 | 1.89x | 2.37x | 2.66x | 2.75x textwide | rand | 256 | 1557.90 | 1.85x | 2.34x | 2.64x | 2.73x textwide | rand | 512 | 1547.97 | 1.87x | 2.32x | 2.64x | 2.71x "textwide" "asc" is nearly an order of magnitude faster than other initial orders without parallelism, but then parallelism doesn't seem to help it much. Also, using more that 64MB doesn't ever seem to help very much; in the "desc" case it hinders. I was curious to understand how performance changes if we become just a bit less correlated (rather than completely uncorrelated or perfectly inversely correlated), so I tried out a 'banana skin' case: I took the contents of the textwide asc table and copied it to a new table, and then moved the 900 words matching 'banana%' to the physical end of the heap by deleting and reinserting them in one transaction. I guess if we were to use this technology for CLUSTER, this might be representative of a situation where you regularly recluster a growing table. The results were pretty much like "asc": tab | ord | mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 ----------+--------+-----+--------+-------+-------+-------+------- textwide | banana | 64 | 213.39 | 1.17x | 1.11x | 1.15x | 1.09x It's hard to speculate about this, but I guess that a significant number of indexes in real world databases might be uncorrelated to insert order. A newly imported or insert-only table might have one highly correlated index for a surrogate primary key or time column, but other indexes might tend to be uncorrelated. But really, who knows... in a kind of textbook perfectly correlated case such as a time series table with an append-only time or sequence based key, you might want to use BRIN rather than B-Tree anyway. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Feb 9, 2017 at 7:10 PM, Peter Geoghegan <pg@bowt.ie> wrote: > At the risk of stating the obvious, ISTM that the right way to do > this, at a high level, is to err on the side of unneeded extra > unlink() calls, not leaking files. And, to make the window for problem > ("remaining hole that you haven't quite managed to plug") practically > indistinguishable from no hole at all, in a way that's kind of baked > into the API. I do not think there should be any reason why we can't get the resource accounting exactly correct here. If a single backend manages to remove every temporary file that it creates exactly once (and that's currently true, modulo system crashes), a group of cooperating backends ought to be able to manage to remove every temporary file that any of them create exactly once (again, modulo system crashes). I do agree that a duplicate unlink() call isn't as bad as a missing unlink() call, at least if there's no possibility that the filename could have been reused by some other process, or some other part of our own process, which doesn't want that new file unlinked. But it's messy. If the seatbelts in your car were to randomly unbuckle, that would be a safety hazard. If they were to randomly refuse to unbuckle, you wouldn't say "that's OK because it's not a safety hazard", you'd say "these seatbelts are badly designed". And I think the same is true of this mechanism. The way to make this 100% reliable is to set things up so that there is joint ownership from the beginning and shared state that lets you know whether the work has already been done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 16, 2017 at 6:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 9, 2017 at 7:10 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> At the risk of stating the obvious, ISTM that the right way to do >> this, at a high level, is to err on the side of unneeded extra >> unlink() calls, not leaking files. And, to make the window for problem >> ("remaining hole that you haven't quite managed to plug") practically >> indistinguishable from no hole at all, in a way that's kind of baked >> into the API. > > I do not think there should be any reason why we can't get the > resource accounting exactly correct here. If a single backend manages > to remove every temporary file that it creates exactly once (and > that's currently true, modulo system crashes), a group of cooperating > backends ought to be able to manage to remove every temporary file > that any of them create exactly once (again, modulo system crashes). I believe that we are fully in agreement here. In particular, I think it's bad that there is an API that says "caller shouldn't throw an elog error between these two points", and that will be fixed before too long. I just think that it's worth acknowledging a certain nuance. > I do agree that a duplicate unlink() call isn't as bad as a missing > unlink() call, at least if there's no possibility that the filename > could have been reused by some other process, or some other part of > our own process, which doesn't want that new file unlinked. But it's > messy. If the seatbelts in your car were to randomly unbuckle, that > would be a safety hazard. If they were to randomly refuse to > unbuckle, you wouldn't say "that's OK because it's not a safety > hazard", you'd say "these seatbelts are badly designed". And I think > the same is true of this mechanism. If it happened in the lifetime of only one out of a million seatbelts manufactured, and they were manufactured at a competitive price (not over-engineered), I probably wouldn't say that. The fact that the existing resource manger code only LOGs most temp file related failures suggests to me that that's a "can't happen" condition, but we still hedge. I would still like to hedge against even (theoretically) impossible risks. Maybe I'm just being pedantic here, since we both actually want the code to do the same thing. -- Peter Geoghegan
On Wed, Feb 15, 2017 at 6:05 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here are some results with your latest patch, using the same test as > before but this time with SCALE=100 (= 100,000,000 rows). Cool. > To reduce the number of combinations I did only unique data and built > only non-unique indexes with only 'wide' tuples (= key plus a text > column that holds a 151-character wide string, rather than just the > key), and also didn't bother with the 1MB memory size as suggested. > Here are the results up to 4 workers (a results table going up to 8 > workers is attached, since it wouldn't format nicely if I pasted it > here). I think that you are still I/O bound in a way that is addressable by adding more disks. The exception is the text cases, where the patch does best. (I don't place too much emphasis on that because I know that in the long term, we'll have abbreviated keys, which will take some of the sheen off of that.) > Again, the w = 0 time is seconds, the rest show relative > speed-up. I think it's worth pointing out that while there are cases where we see no benefit from going from 4 to 8 workers, it tends to hardly hurt at all, or hardly help at all. It's almost irrelevant that the number of workers used is excessive, at least up until the point when all cores have their own worker. That's a nice quality for this to have -- the only danger is that we use parallelism when we shouldn't have at all, because the serial case could manage an internal sort, and the sort was small enough that that could be a notable factor. > "textwide" "asc" is nearly an order of magnitude faster than other > initial orders without parallelism, but then parallelism doesn't seem > to help it much. Also, using more that 64MB doesn't ever seem to help > very much; in the "desc" case it hinders. Maybe it's CPU cache efficiency? There are edge cases where multiple passes are faster than one pass. That'ks the only explanation I can think of. > I was curious to understand how performance changes if we become just > a bit less correlated (rather than completely uncorrelated or > perfectly inversely correlated), so I tried out a 'banana skin' case: > I took the contents of the textwide asc table and copied it to a new > table, and then moved the 900 words matching 'banana%' to the physical > end of the heap by deleting and reinserting them in one transaction. A likely problem with that is that most runs will actually not have their own banana skin, so to speak. You only see a big drop in performance when every quicksort operation has presorted input, but with one or more out-of-order tuples at the end. In order to see a really unfortunate case with parallel CREATE INDEX, you'd probably have to have enough memory that workers don't need to do their own merge (and so worker's work almost entirely consists of one big quicksort operation), with enough "banana skin heap pages" that the parallel heap scan is pretty much guaranteed to end up giving "banana skin" (out of order) tuples to every worker, making all of them "have a slip" (throw away a huge amount of work as the presorted optimization is defeated right at the end of its sequential read through). A better approach would be to have several small localized areas across input where input tuples are a little out of order. That would probably show that the performance is pretty in line with random cases. > It's hard to speculate about this, but I guess that a significant > number of indexes in real world databases might be uncorrelated to > insert order. That would certainly be true with text, where we see a risk of (small) regressions. -- Peter Geoghegan
On Thu, Feb 16, 2017 at 11:45 AM, Peter Geoghegan <pg@bowt.ie> wrote: > Maybe I'm just being pedantic here, since we both actually want the > code to do the same thing. Pedantry from either of us? Nah... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Feb 11, 2017 at 1:52 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> Yes, potentially unbounded in rare case. If we plan for N batches, >> and then run out of work_mem because our estimates were just wrong or >> the distributions of keys is sufficiently skewed, we'll run >> HashIncreaseNumBatches, and that could happen more than once. I have >> a suite of contrived test queries that hits all the various modes and >> code paths of hash join, and it includes a query that plans for one >> batch but finishes up creating many, and then the leader exits. I'll >> post that to the other thread along with my latest patch series soon. > > Hmm, OK. So that's going to probably require something where a fixed > amount of DSM can describe an arbitrary number of temp file series. > But that also means this is an even-more-special-purpose tool that > shouldn't be deeply tied into parallel.c so that it can run before any > errors happen. > > Basically, I think the "let's write the code between here and here so > it throws no errors" technique is, for 99% of PostgreSQL programming, > difficult and fragile. We shouldn't rely on it if there is some other > reasonable option. I'm testing a patch that lets you set up a fixed sized SharedBufFileSet object in a DSM segment, with its own refcount for the reason you explained. It supports a dynamically expandable set of numbered files, so each participant gets to export file 0, file 1, file 2 and so on as required, in any order. I think this should suit both Parallel Tuplesort which needs to export just one file from each participant, and Parallel Shared Hash which doesn't know up front how many batches it will produce. Not quite ready but I will post a version tomorrow to get Peter's reaction. -- Thomas Munro http://www.enterprisedb.com
On Wed, Mar 1, 2017 at 10:29 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I'm testing a patch that lets you set up a fixed sized > SharedBufFileSet object in a DSM segment, with its own refcount for > the reason you explained. It supports a dynamically expandable set of > numbered files, so each participant gets to export file 0, file 1, > file 2 and so on as required, in any order. I think this should suit > both Parallel Tuplesort which needs to export just one file from each > participant, and Parallel Shared Hash which doesn't know up front how > many batches it will produce. Not quite ready but I will post a > version tomorrow to get Peter's reaction. See 0007-hj-shared-buf-file-v6.patch in the v6 tarball in the parallel shared hash thread. -- Thomas Munro http://www.enterprisedb.com
On Thu, Feb 16, 2017 at 8:45 AM, Peter Geoghegan <pg@bowt.ie> wrote: >> I do not think there should be any reason why we can't get the >> resource accounting exactly correct here. If a single backend manages >> to remove every temporary file that it creates exactly once (and >> that's currently true, modulo system crashes), a group of cooperating >> backends ought to be able to manage to remove every temporary file >> that any of them create exactly once (again, modulo system crashes). > > I believe that we are fully in agreement here. In particular, I think > it's bad that there is an API that says "caller shouldn't throw an > elog error between these two points", and that will be fixed before > too long. I just think that it's worth acknowledging a certain nuance. I attach my V9 of the patch. I came up some stuff for the design of resource management that I think meets every design goal that we have for shared/unified BufFiles: * Avoids both resource leaks, and spurious double-freeing of resources (e.g., a second unlink() for a file from a different process) when there are errors. The latter problem was possible before, a known issue with V8 of the patch. I believe that this revision avoids these problems in a way that is *absolutely bulletproof* in the face of arbitrary failures (e.g., palloc() failure) in any process at any time. Although, be warned that there is a remaining open item concerning resource management in the leader-as-worker case, which I go into below. There are now what you might call "critical sections" in one function. That is, there are points where we cannot throw an error (without a BEGIN_CRIT_SECTION()!), but those are entirely confined to unification code within the leader, where we can be completely sure that no error can be raised. The leader can even fail before some but not all of a particular worker's segments are in its local resource manager, and we still do the right thing. I've been testing this by adding code that randomly throws errors at points interspersed throughout worker and leader unification hand-off points. I then leave this stress-test build to run for a few hours, while monitoring for leaked files and spurious fd.c reports of double-unlink() and similar issues. Test builds change LOG to PANIC within several places in fd.c, while MAX_PHYSICAL_FILESIZE was reduced from 1GiB to BLCKSZ. All of these guarantees are made without any special care from caller to buffile.c. The only V9 change to tuplesort.c or logtape.c in this general area is that they have to pass a dynamic shared memory segment to buffile.c, so that it can register a new callback. That's it. This may be of particular interest to Thomas. All complexity is confined to buffile.c. * No expansion in the use of shared memory to manage resources. BufFile refcount is still per-worker. The role of local resource managers is unchanged. * Additional complexity over and above ordinary BufFile resource management is confined to the leader process and its on_dsm_detach() callback. Only the leader registers a callback. Of course, refcount management within BufFileClose() can still take place in workers, but that isn't something that we rely on (that's only for non-error paths). In general, worker processes mostly have resource managers managing their temp file segments as a thing that has nothing to do with BufFiles (BufFiles are still not owned by resowner.c/fd.c -- they're blissfully unaware of all of this stuff). * In general, unified BufFiles can still be treated in exactly the same way as conventional BufFiles, and things just work, without any special cases being exercised internally. There is still an open item here, though: The leader-as-worker Tuplesortstate, a special case, can still leak files. So, stress-testing will only show the patch to be completely robust against resource leaks when nbtsort.c is modified to enable FORCE_SINGLE_WORKER testing. Despite the name FORCE_SINGLE_WORKER, you can also modify that file to force there to be arbitrary-many workers requested (just change "requested = 1" to something else). The leader-as-worker problem is avoided because we don't have the leader participating as a worker this way, which would otherwise present issues for resowner.c that I haven't got around to fixing just yet. It isn't hard to imagine why this is -- one backend with two FDs for certain fd.c temp segments is just going to cause problems for resowner.c without additional special care. Didn't seem worth blocking on that. I want to prove that my general approach is workable. That problem is confined to one backend's resource manager when it is the leader participating as a worker. It is not a refcount problem. The simplest solution here would be to ban the leader-as-worker case by contract. Alternatively, we could pass fd.c segments from the leader-as-worker Tuplesortstate's BufFile to the leader Tuplesortstate's BufFile without opening or closing anything. This way, there will be no second vFD entry for any segment at any time. I've also made several changes to the cost model, changes agreed to over on the "Cost model for parallel CREATE INDEX" thread. No need for a recap on what those changes are here. In short, things have been *significantly* simplified in that area. Finally, note that I decided to throw out more code within tuplesort.c. Now, a parallel leader is a thing that is explicitly set up to be exactly consistent with a conventional/serial external sort whose merge is about to begin. In particular, it now uses mergeruns(). Robert said that he thinks that this is a patch that is to some degree a parallelism patch, and to some degree about sorting. I'd say that by now, it's roughly 5% about sorting, in terms of the proportion of code that expressly considers sorting. Most of the new stuff in tuplesort.c is about managing dependencies between participating backends. I've really focused on avoiding new special cases, especially with V9. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan <pg@bowt.ie> wrote: > There is still an open item here, though: The leader-as-worker > Tuplesortstate, a special case, can still leak files. I phrased this badly. What I mean is that there can be instances where temp files are left on disk following a failure such as palloc() OOM; no backend ends up doing an unlink() iff a leader-as-worker Tuplesortstate was used and we get unlucky. I did not mean a leak of virtual or real file descriptors, which would see Postgres print a refcount leak warning from resowner.c. Naturally, these "leaked" files will eventually be deleted by the next restart of the server at the latest, within RemovePgTempFiles(). Note also that a duplicate unlink() (with annoying LOG message) is impossible under any circumstances with V9, regardless of whether or not a leader-as-worker Tuplesort state is involved. Anyway, I was sure that I needed to completely nail this down in order to be consistent with existing guarantees, but another look at OpenTemporaryFile() makes me doubt that. ResourceOwnerEnlargeFiles() is called, which itself uses palloc(), which can of course fail. There are remarks over that function within resowner.c about OOM: /** Make sure there is room for at least one more entry in a ResourceOwner's* files reference array.** This is separate fromactually inserting an entry because if we run out* of memory, it's critical to do so *before* acquiring the resource.*/ void ResourceOwnerEnlargeFiles(ResourceOwner owner) { ... } But this happens after OpenTemporaryFileInTablespace() has already returned. Taking care to allocate memory up-front here is motivated by keeping the vFD cache entry and current resource owner in perfect agreement about the FD_XACT_TEMPORARY-ness of a file, and that's it. It's *not* true that there is a broader sense in which OpenTemporaryFile() is atomic, which for some reason I previously believed to be the case. So, I haven't failed to prevent an outcome that wasn't already possible. It doesn't seem like it would be that hard to fix this, and then have the parallel tuplesort patch live up to that new higher standard. But, it's possible that Tom or maybe someone else would consider that a bad idea, for roughly the same reason that we don't call RemovePgTempFiles() for *crash* induced restarts, as mentioned by Thomas up-thead: * NOTE: we could, but don't, call this during a post-backend-crash restart* cycle. The argument for not doing it is thatsomeone might want to examine* the temp files for debugging purposes. This does however mean that* OpenTemporaryFilehad better allow for collision with an existing temp* file name.*/ void RemovePgTempFiles(void) { ... } Note that I did put some thought into making sure OpenTemporaryFile() does the right thing with collisions with existing temp files. So, maybe the right thing is to do nothing at all. I don't have strong feelings either way on this question. -- Peter Geoghegan
On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I attach my V9 of the patch. I came up some stuff for the design of > resource management that I think meets every design goal that we have > for shared/unified BufFiles: Commit 2609e91fc broke the parallel CREATE INDEX cost model. I should now pass -1 as the index block argument to compute_parallel_worker(), just as all callers that aren't parallel index scan do after that commit. This issue caused V9 to never choose parallel CREATE INDEX within nbtsort.c. There was also a small amount of bitrot. Attached V10 fixes this regression. I also couldn't resist adding a few new assertions that I thought were worth having to buffile.c, plus dedicated wait events for parallel tuplesort. And, I fixed a silly bug added in V9 around where worker_wait() should occur. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sun, Mar 19, 2017 at 9:03 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> I attach my V9 of the patch. I came up some stuff for the design of >> resource management that I think meets every design goal that we have >> for shared/unified BufFiles: > > Commit 2609e91fc broke the parallel CREATE INDEX cost model. I should > now pass -1 as the index block argument to compute_parallel_worker(), > just as all callers that aren't parallel index scan do after that > commit. This issue caused V9 to never choose parallel CREATE INDEX > within nbtsort.c. There was also a small amount of bitrot. > > Attached V10 fixes this regression. I also couldn't resist adding a > few new assertions that I thought were worth having to buffile.c, plus > dedicated wait events for parallel tuplesort. And, I fixed a silly bug > added in V9 around where worker_wait() should occur. Some initial review comments: - * This code is moderately slow (~10% slower) compared to the regular - * btree (insertion) build code on sorted or well-clustered data. On - * random data, however, the insertion build code is unusable -- the - * difference on a 60MB heap is a factor of 15 because the random - * probes into the btree thrash the buffer pool. (NOTE: the above - * "10%" estimate is probably obsolete, since it refers to an old and - * not very good external sort implementation that used to exist in - * this module. tuplesort.c is almost certainly faster.) While I agree that the old comment is probably inaccurate, I don't think dropping it without comment in a patch to implement parallel sorting is the way to go. How about updating it to be more current as a separate patch? +/* Magic numbers for parallel state sharing */ +#define PARALLEL_KEY_BTREE_SHARED UINT64CONST(0xA000000000000001) +#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA000000000000002) +#define PARALLEL_KEY_TUPLESORT_SPOOL2 UINT64CONST(0xA000000000000003) 1, 2, and 3 would probably work just as well. The parallel infrastructure uses high-numbered values to avoid conflict with plan_node_id values, but this is a utility statement so there's no such problem. But it doesn't matter very much. + * Note: caller had better already hold some type of lock on the table and + * index. + */ +int +plan_create_index_workers(Oid tableOid, Oid indexOid) Caller should pass down the Relation rather than the Oid. That is better both because it avoids unnecessary work and because it more or less automatically avoids the problem mentioned in the note. Why is this being put in planner.c rather than something specific to creating indexes? Not sure that's a good idea. + * This should be called when workers have flushed out temp file buffers and + * yielded control to caller's process. Workers should hold open their + * BufFiles at least until the caller's process is able to call here and + * assume ownership of BufFile. The general pattern is that workers make + * available data from their temp files to one nominated process; there is + * no support for workers that want to read back data from their original + * BufFiles following writes performed by the caller, or any other + * synchronization beyond what is implied by caller contract. All + * communication occurs in one direction. All output is made available to + * caller's process exactly once by workers, following call made here at the + * tail end of processing. Thomas has designed a system for sharing files among cooperating processes that lacks several of these restrictions. With his system, it's still necessary for all data to be written and flushed by the writer before anybody tries to read it. But the restriction that the worker has to hold its BufFile open until the leader can assume ownership goes away. That's a good thing; it avoids the need for workers to sit around waiting for the leader to assume ownership of a resource instead of going away faster and freeing up worker slots for some other query, or moving on to some other computation. The restriction that the worker can't reread the data after handing off the file also goes away. The files can be read and written by any participant in any order, as many times as you like, with only the restriction that the caller must guarantee that data will be written and flushed from private buffers before it can be read. I don't see any reason to commit both his system and your system, and his is more general so I think you should use it. That would cut hundreds of lines from this patch with no real disadvantage that I can see -- including things like worker_wait(), which are only needed because of the shortcomings of the underlying mechanism. + * run. Parallel workers always use quicksort, however. Comment fails to mention a reason. + elog(LOG, "%d using " INT64_FORMAT " KB of memory for read buffers among %d input tapes", + state->worker, state->availMem / 1024, numInputTapes); I think "worker %d" or "participant %d" would be a lot better than just starting the message with "%d". (There are multiple instances of this, with various messages.) I think some of the smaller changes that this patch makes, like extending the parallel context machinery to support SnapshotAny, could be usefully broken out as separately-committable patches. I haven't really dug down into the details here, but with the exception of the buffile.c stuff which I don't like, the overall design of this seems pretty sensible to me. We might eventually want to do something more clever at the sorting level, but those changes would be confined to tuplesort.c, and all the other changes you've introduced here would stand on their own. Which is to say that even if there's more win to be had here, this is a good start. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 21, 2017 at 9:10 AM, Robert Haas <robertmhaas@gmail.com> wrote: > - * This code is moderately slow (~10% slower) compared to the regular > - * btree (insertion) build code on sorted or well-clustered data. On > - * random data, however, the insertion build code is unusable -- the > - * difference on a 60MB heap is a factor of 15 because the random > - * probes into the btree thrash the buffer pool. (NOTE: the above > - * "10%" estimate is probably obsolete, since it refers to an old and > - * not very good external sort implementation that used to exist in > - * this module. tuplesort.c is almost certainly faster.) > > While I agree that the old comment is probably inaccurate, I don't > think dropping it without comment in a patch to implement parallel > sorting is the way to go. How about updating it to be more current as > a separate patch? I think that since the comment refers to code from before 1999, it can go. Any separate patch to remove it would have an entirely negative linediff. > +/* Magic numbers for parallel state sharing */ > 1, 2, and 3 would probably work just as well. Okay. > Why is this being put in planner.c rather than something specific to > creating indexes? Not sure that's a good idea. The idea is that it's the planner's domain, but this is a utility statement, so it makes sense to put it next to the CLUSTER function that determine whether CLUSTER sorts rather than does an index scan. I don't have strong feelings on how appropriate that is. > + * This should be called when workers have flushed out temp file buffers and > + * yielded control to caller's process. Workers should hold open their > + * BufFiles at least until the caller's process is able to call here and > + * assume ownership of BufFile. The general pattern is that workers make > + * available data from their temp files to one nominated process; there is > + * no support for workers that want to read back data from their original > + * BufFiles following writes performed by the caller, or any other > + * synchronization beyond what is implied by caller contract. All > + * communication occurs in one direction. All output is made available to > + * caller's process exactly once by workers, following call made here at the > + * tail end of processing. > > Thomas has designed a system for sharing files among cooperating > processes that lacks several of these restrictions. With his system, > it's still necessary for all data to be written and flushed by the > writer before anybody tries to read it. But the restriction that the > worker has to hold its BufFile open until the leader can assume > ownership goes away. That's a good thing; it avoids the need for > workers to sit around waiting for the leader to assume ownership of a > resource instead of going away faster and freeing up worker slots for > some other query, or moving on to some other computation. The > restriction that the worker can't reread the data after handing off > the file also goes away. There is no restriction about workers not being able to reread data. That comment makes it clear that that's only when the leader writes to the file. It alludes to rereading within a worker following the leader writing to their files in order to recycle blocks within logtape.c, which the patch never has to do, unless you enable one of the 0002-* testing GUCs to force randomAccess. Obviously iff you write to the file in the leader, there is little that the worker can do afterwards, but it's not a given that you'd want to do that, and this patch actually never does. You could equally well say that PHJ fails to provide for my requirement for having the leader write to the files sensibly in order to recycle blocks, a requirement that its shared BufFile mechanism expressly does not support. > That would cut hundreds of > lines from this patch with no real disadvantage that I can see -- > including things like worker_wait(), which are only needed because of > the shortcomings of the underlying mechanism. I think it would definitely be a significant net gain in LOC. And, worker_wait() will probably be replaced by the use of the barrier abstraction anyway. It didn't seem worth creating a dependency on early given my simple requirements. PHJ uses barriers instead, presumably because there is much more of this stuff. The workers generally won't have to wait at all. It's expected to be pretty much instantaneous. > + * run. Parallel workers always use quicksort, however. > > Comment fails to mention a reason. Well, I don't think that there is any reason to use replacement selection at all, what with the additional merge heap work last year. But, the theory there remains that RS is good when you can get one big run and no merge. You're not going to get that with parallel sort in any case, since the leader must merge. Besides, merging in the workers happens in the workers. And, the backspace requirement of 32MB of workMem per participant pretty much eliminates any use of RS that you'd get otherwise. > I think "worker %d" or "participant %d" would be a lot better than > just starting the message with "%d". (There are multiple instances of > this, with various messages.) Okay. > I think some of the smaller changes that this patch makes, like > extending the parallel context machinery to support SnapshotAny, could > be usefully broken out as separately-committable patches. Okay. > I haven't really dug down into the details here, but with the > exception of the buffile.c stuff which I don't like, the overall > design of this seems pretty sensible to me. We might eventually want > to do something more clever at the sorting level, but those changes > would be confined to tuplesort.c, and all the other changes you've > introduced here would stand on their own. Which is to say that even > if there's more win to be had here, this is a good start. That's certainly how I feel about it. I believe that the main reason that you like the design I came up with on the whole is that it's minimally divergent from the serial case. The changes in logtape.c and tuplesort.c are actually very minor. But, the reason that that's possible at all is because buffile.c adds some complexity that is all about maintaining existing assumptions. You don't like that complexity. I would suggest that it's useful that I've been able to isolate it to buffile.c fairly well. A quick tally of the existing assumptions this patch preserves: 1. Resource managers still work as before. This means that error handling will work the same way as before. We cooperate with that mechanism, rather than supplanting it entirely. 2. There is only one BufFile per logical tapeset per tuplesort, in both workers and the leader. 3. You can write to the end of a unified BufFile in leader to have it extended, while resource managers continue to do the right thing despite differing requirements for each segment. This leaves things sane for workers to read, provided the leader keeps to its own space in the unified BufFile. 4. Temp files must go away at EoX, no matter what. Thomas has created a kind of shadow resource manager in shared memory. So, he isn't using fd.c resource management stuff. He is concerned with a set of BufFiles, each of which has specific significance to each parallel hash join (they're per worker HJ batch). PHJ has an unpredictable number of BufFiles, while parallel tuplesort always has one, just as before. For the most part, I think that what Thomas has done reflects his own requirements, just as what I've done reflects my requirements. There seems to be no excellent opportunity to use a common infrastructure. I think that not cooperating with the existing mechanism will prove to be buggy. Following a quick look at the latest PHJ patch series, and its 0008-hj-shared-buf-file-v8.patch file, I already see one example. I notice that there could be multiple calls to pgstat_report_tempfile() within each backend for the same BufFile segment. Isn't that counting the same thing more than once? In general, it seems problematic that there is now "true" fd.c temp segments, as well as Shared BufFile temp segments that are never in a backend resource manager. -- Peter Geoghegan
On Tue, Mar 21, 2017 at 2:03 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I think that since the comment refers to code from before 1999, it can > go. Any separate patch to remove it would have an entirely negative > linediff. It's a good general principle that a patch should do one thing well and not make unrelated changes. I try hard to adhere to that principle in my commits, and I think other committers generally do (and should), too. Of course, different people draw the line in different places. If you can convince another committer to include that change in their commit of this patch, well, that's not my cup of tea, but so be it. If you want me to consider committing this, you're going to have to submit that part separately, preferably on a separate thread with a suitably descriptive subject line. > Obviously iff you write to the file in the leader, there is little > that the worker can do afterwards, but it's not a given that you'd > want to do that, and this patch actually never does. You could equally > well say that PHJ fails to provide for my requirement for having the > leader write to the files sensibly in order to recycle blocks, a > requirement that its shared BufFile mechanism expressly does not > support. From my point of view, the main point is that having two completely separate mechanisms for managing temporary files that need to be shared across cooperating workers is not a good decision. That's a need that's going to come up over and over again, and it's not reasonable for everybody who needs it to add a separate mechanism for doing it. We need to have ONE mechanism for it. The second point is that I'm pretty convinced that the design you've chosen is fundamentally wrong. I've attempt to explain that multiple times, starting about three months ago with http://postgr.es/m/CA+TgmoYP0vzPw64DfMQT1JHY6SzyAvjogLkj3erMZzzN2f9xLA@mail.gmail.com and continuing across many subsequent emails on multiple threads. It's just not OK in my book for a worker to create something that it initially owns and then later transfer it to the leader. The cooperating backends should have joint ownership of the objects from the beginning, and the last process to exit the set should clean up those resources. >> That would cut hundreds of >> lines from this patch with no real disadvantage that I can see -- >> including things like worker_wait(), which are only needed because of >> the shortcomings of the underlying mechanism. > > I think it would definitely be a significant net gain in LOC. And, > worker_wait() will probably be replaced by the use of the barrier > abstraction anyway. No, because if you do it Thomas's way, the worker can exit right away, without waiting. You don't have to wait via a different method; you escape waiting altogether. I understand that your point is that the wait will always be brief, but I think that's probably an optimistic assumption and definitely an unnecessary assumption. It's optimistic because there is absolutely no guarantee that all workers will take the same amount of time to sort the data they read. It is absolutely not the case that all data sets sort at the same speed. Because of the way parallel sequential scan works, we're somewhat insulated from that; workers that sort faster will get a larger chunk of the table. However, that only means that workers will finish generating their sorted runs at about the same time, not that they will finish merging at the same time. And, indeed, if some workers end up with more data than others (so that they finish building runs at about the same time) then some will probably take longer to complete the merging than others. But even if were true that the waits will always be brief, I still think the way you've done it is a bad idea, because now tuplesort.c has to know that it needs to wait because of some detail of lower-level resource management about which it should not have to care. That alone is a sufficient reason to want a better approach. I completely accept that whatever abstraction we use at the BufFile level has to be something that can be plumbed into logtape.c, and if Thomas's mechanism can't be bolted in there in a sensible way then that's a problem. But I feel quite strongly that the solution to that problem isn't to adopt the approach you've taken here. >> + * run. Parallel workers always use quicksort, however. >> >> Comment fails to mention a reason. > > Well, I don't think that there is any reason to use replacement > selection at all, what with the additional merge heap work last year. > But, the theory there remains that RS is good when you can get one big > run and no merge. You're not going to get that with parallel sort in > any case, since the leader must merge. Besides, merging in the workers > happens in the workers. And, the backspace requirement of 32MB of > workMem per participant pretty much eliminates any use of RS that > you'd get otherwise. So, please mention that briefly in the comment. > I believe that the main reason that you like the design I came up with > on the whole is that it's minimally divergent from the serial case. That's part of it, I guess, but it's more that the code you've added to do parallelism here looks an awful lot like what's gotten added to do parallelism in other cases, like parallel query. That's probably a good sign. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 21, 2017 at 12:06 PM, Robert Haas <robertmhaas@gmail.com> wrote: > From my point of view, the main point is that having two completely > separate mechanisms for managing temporary files that need to be > shared across cooperating workers is not a good decision. That's a > need that's going to come up over and over again, and it's not > reasonable for everybody who needs it to add a separate mechanism for > doing it. We need to have ONE mechanism for it. Obviously I understand that there is value in code reuse in general. The exact extent to which code reuse is possible here has been unclear throughout, because it's complicated for all kinds of reasons. That's why Thomas and I had 2 multi-hour Skype calls all about it. > It's just not OK in my book for a worker to create something that it > initially owns and then later transfer it to the leader. Isn't that an essential part of having a refcount, in general? You were the one that suggested refcounting. > The cooperating backends should have joint ownership of the objects from > the beginning, and the last process to exit the set should clean up > those resources. That seems like a facile summary of the situation. There is a sense in which there is always joint ownership of files with my design. But there is also a sense is which there isn't, because it's impossible to do that while not completely reinventing resource management of temp files. I wanted to preserve resowner.c ownership of fd.c segments. You maintain that it's better to have the leader unlink() everything at the end, and suppress the errors when that doesn't work, so that that path always just plows through. I disagree with that. It is a trade-off, I suppose. I have now run out of time to work through it with you or Thomas, though. > But even if were true that the waits will always be brief, I still > think the way you've done it is a bad idea, because now tuplesort.c > has to know that it needs to wait because of some detail of > lower-level resource management about which it should not have to > care. That alone is a sufficient reason to want a better approach. There is already a point at which the leader needs to wait, so that it can accumulate stats that nbtsort.c cares about. So we already need a leader wait point within nbtsort.c (that one is called directly by nbtsort.c). Doesn't seem like too bad of a wart to have the same thing for workers. >> I believe that the main reason that you like the design I came up with >> on the whole is that it's minimally divergent from the serial case. > > That's part of it, I guess, but it's more that the code you've added > to do parallelism here looks an awful lot like what's gotten added to > do parallelism in other cases, like parallel query. That's probably a > good sign. It's also a good sign that it makes CREATE INDEX approximately 3 times faster. -- Peter Geoghegan
On Tue, Mar 21, 2017 at 3:50 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Mar 21, 2017 at 12:06 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> From my point of view, the main point is that having two completely >> separate mechanisms for managing temporary files that need to be >> shared across cooperating workers is not a good decision. That's a >> need that's going to come up over and over again, and it's not >> reasonable for everybody who needs it to add a separate mechanism for >> doing it. We need to have ONE mechanism for it. > > Obviously I understand that there is value in code reuse in general. > The exact extent to which code reuse is possible here has been unclear > throughout, because it's complicated for all kinds of reasons. That's > why Thomas and I had 2 multi-hour Skype calls all about it. I agree that the extent to which code reuse is possible here is somewhat unclear, but I am 100% confident that the answer is non-zero. You and Thomas both need BufFiles that can be shared across multiple backends associated with the same ParallelContext. I don't understand how you can argue that it's reasonable to have two different ways of sharing the same kind of object across the same set of processes. And if that's not reasonable, then somehow we need to come up with a single mechanism that can meet both your requirements and Thomas's requirements. >> It's just not OK in my book for a worker to create something that it >> initially owns and then later transfer it to the leader. > > Isn't that an essential part of having a refcount, in general? You > were the one that suggested refcounting. No, quite the opposite. My point in suggesting adding a refcount was to avoid needing to have a single owner. Instead, the process that decrements the reference count to zero becomes responsible for doing the cleanup. What you've done with the ref count is use it as some kind of medium for transferring responsibility from backend A to backend B; what I want is to allow backends A, B, C, D, E, and F to attach to the same shared resource, and whichever one of them happens to be the last one out of the room shuts off the lights. >> The cooperating backends should have joint ownership of the objects from >> the beginning, and the last process to exit the set should clean up >> those resources. > > That seems like a facile summary of the situation. There is a sense in > which there is always joint ownership of files with my design. But > there is also a sense is which there isn't, because it's impossible to > do that while not completely reinventing resource management of temp > files. I wanted to preserve resowner.c ownership of fd.c segments. As I've said before, I think that's an anti-goal. This is a different problem, and trying to reuse the solution we chose for the non-parallel case doesn't really work. resowner.c could end up owning a shared reference count which it's responsible for decrementing -- and then decrementing it removes the file if the result is zero. But it can't own performing the actual unlink(), because then we can't support cases where the file may have multiple readers, since whoever owns the unlink() might try to zap the file out from under one of the others. > You maintain that it's better to have the leader unlink() everything > at the end, and suppress the errors when that doesn't work, so that > that path always just plows through. I don't want the leader to be responsible for anything. I want the last process to detach to be responsible for cleanup, regardless of which process that ends up being. I want that for lots of good reasons which I have articulated including (1) it's how all other resource management for parallel query already works, e.g. DSM, DSA, and group locking; (2) it avoids the need for one process to sit and wait until another process assumes ownership, which isn't a feature even if (as you contend, and I'm not convinced) it doesn't hurt much; and (3) it allows for use cases where multiple processes are reading from the same shared BufFile without the risk that some other process will try to unlink() the file while it's still in use. The point for me isn't so much whether unlink() ever ignores errors as whether cleanup (however defined) is an operation guaranteed to happen exactly once. > I disagree with that. It is a > trade-off, I suppose. I have now run out of time to work through it > with you or Thomas, though. Bummer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 22, 2017 at 10:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 21, 2017 at 3:50 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> I disagree with that. It is a >> trade-off, I suppose. I have now run out of time to work through it >> with you or Thomas, though. > > Bummer. I'm going to experiment with refactoring the v10 parallel CREATE INDEX patch to use the SharedBufFileSet interface from hj-shared-buf-file-v8.patch today and see what problems I run into. -- Thomas Munro http://www.enterprisedb.com
On Tue, Mar 21, 2017 at 2:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I agree that the extent to which code reuse is possible here is > somewhat unclear, but I am 100% confident that the answer is non-zero. > You and Thomas both need BufFiles that can be shared across multiple > backends associated with the same ParallelContext. I don't understand > how you can argue that it's reasonable to have two different ways of > sharing the same kind of object across the same set of processes. I didn't argue that. Rather, I argued that there is going to be significant additional requirements for PHJ, because it has to support arbitrary many BufFiles, rather than either 1 or 2 (one per tuplesort/logtapeset). Just how "signficant" that would be I cannot say, regrettably. (Or, we're going to have to make logtape.c multiplex BufFiles, which risks breaking other logtape.c routines that aren't even used just yet.) >> Isn't that an essential part of having a refcount, in general? You >> were the one that suggested refcounting. > > No, quite the opposite. My point in suggesting adding a refcount was > to avoid needing to have a single owner. Instead, the process that > decrements the reference count to zero becomes responsible for doing > the cleanup. What you've done with the ref count is use it as some > kind of medium for transferring responsibility from backend A to > backend B; what I want is to allow backends A, B, C, D, E, and F to > attach to the same shared resource, and whichever one of them happens > to be the last one out of the room shuts off the lights. Actually, that's quite possible with the design I came up with. The restriction that Thomas can't live with as I've left things is that you have to know the number of BufFiles ahead of time. I'm pretty sure that that's all it is. (I do sympathize with the fact that that isn't very helpful to him, though.) > As I've said before, I think that's an anti-goal. This is a different > problem, and trying to reuse the solution we chose for the > non-parallel case doesn't really work. resowner.c could end up owning > a shared reference count which it's responsible for decrementing -- > and then decrementing it removes the file if the result is zero. But > it can't own performing the actual unlink(), because then we can't > support cases where the file may have multiple readers, since whoever > owns the unlink() might try to zap the file out from under one of the > others. Define "zap the file". I think, based on your remarks here, that you've misunderstood my design. I think you should at least understand it fully if you're going to dismiss it. It is true that a worker resowner can unlink() the files mid-unification, in the same manner as with conventional temp files, and not decrement its refcount in shared memory, or care at all in any special way. This is okay because the leader (in the case of parallel tuplesort) will realize that it should not "turn out the lights", finding that remaining reference when it calls BufFileClose() in registered callback, as it alone must. It doesn't matter that the unlink() may have already occurred, or may be just about to occur, because we are only operating on already-opened files, and never on the link itself (we don't have to stat() the file link for example, which is naturally only a task for the unlink()'ing backend anyway). You might say that the worker only blows away the link itself, not the file proper, since it may still be open in leader (say). ** We rely on the fact that files are themselves a kind of reference counted thing, in general; they have an independent existence from the link originally used to open() them. ** The reason that there is a brief wait in workers for parallel tuplesort is because it gives us the opportunity to have the immediately subsequent worker BufFileClose() not turn out the lights in worker, because leader must have a reference on the BufFile when workers are released. So, there is a kind of interlock that makes sure that there is always at least 1 owner. ** There would be no need for an additional wait but for the fact the leader wants to unify multiple worker BufFiles as one, and must open them all at once for the sake of simplicity. But that's just how parallel tuplesort in particular happens to work, since it has only one BufFile in the leader, which it wants to operate on with everything set up up-front. ** Thomas' design cannot reliably know how many segments there are in workers in error paths, which necessitates his unlink()-ENOENT-ignore hack. My solution is that workers/owners look after their own temp segments in the conventional way, until they reach BufFileClose(), which may never come if there is an error. The only way that clean-up won't happen in conventional resowner.c-in-worker fashion is if BufFileClose() is reached in owner/worker. BufFileClose() must be reached when there is no error, which has to happen anyway when using temp files. (Else there is a temp file leak warning from resowner.c.) This is the only way to avoid the unlink()-ENOENT-ignore hack, AFAICT, since only the worker itself can reliably know how many segments it has opened at every single instant in time. Because it's the owner! >> You maintain that it's better to have the leader unlink() everything >> at the end, and suppress the errors when that doesn't work, so that >> that path always just plows through. > > I don't want the leader to be responsible for anything. I meant in the case of parallel CREATE INDEX specifically, were it to use this other mechanism. Substitute "leader" with "the last backend" in reading my remarks here. > I want the > last process to detach to be responsible for cleanup, regardless of > which process that ends up being. I want that for lots of good > reasons which I have articulated including (1) it's how all other > resource management for parallel query already works, e.g. DSM, DSA, > and group locking; (2) it avoids the need for one process to sit and > wait until another process assumes ownership, which isn't a feature > even if (as you contend, and I'm not convinced) it doesn't hurt much; > and (3) it allows for use cases where multiple processes are reading > from the same shared BufFile without the risk that some other process > will try to unlink() the file while it's still in use. The point for > me isn't so much whether unlink() ever ignores errors as whether > cleanup (however defined) is an operation guaranteed to happen exactly > once. My patch demonstrably has these properties. I've done quite a bit of fault injection testing to prove it. (Granted, I need to take extra steps for the leader-as-worker backend, a special case, which I haven't done already because I was waiting on your feedback on the appropriate trade-off there.) -- Peter Geoghegan
On Tue, Mar 21, 2017 at 2:49 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I'm going to experiment with refactoring the v10 parallel CREATE INDEX > patch to use the SharedBufFileSet interface from > hj-shared-buf-file-v8.patch today and see what problems I run into. I would be happy if you took over parallel CREATE INDEX completely. It makes a certain amount of sense, and not just because I am no longer able to work on it. You're the one doing things with shared BufFiles that are of significant complexity. Certainly more complicated than what parallel CREATE INDEX needs in every way, and necessarily so. I will still have some more feedback on your shared BufFile design, though, while it's fresh in my mind. -- Peter Geoghegan
On Tue, Mar 21, 2017 at 7:37 PM, Peter Geoghegan <pg@bowt.ie> wrote: >>> Isn't that an essential part of having a refcount, in general? You >>> were the one that suggested refcounting. >> >> No, quite the opposite. My point in suggesting adding a refcount was >> to avoid needing to have a single owner. Instead, the process that >> decrements the reference count to zero becomes responsible for doing >> the cleanup. What you've done with the ref count is use it as some >> kind of medium for transferring responsibility from backend A to >> backend B; what I want is to allow backends A, B, C, D, E, and F to >> attach to the same shared resource, and whichever one of them happens >> to be the last one out of the room shuts off the lights. > > Actually, that's quite possible with the design I came up with. I don't think it is. What sequence of calls do the APIs you've proposed would accomplish that goal? I don't see anything in this patch set that would permit anything other than a handoff from the worker to the leader. There seems to be no way for the ref count to be more than 1 (or 2?). > The > restriction that Thomas can't live with as I've left things is that > you have to know the number of BufFiles ahead of time. I'm pretty sure > that that's all it is. (I do sympathize with the fact that that isn't > very helpful to him, though.) I feel like there's some cognitive dissonance here. On the one hand, you're saying we should use your design. On the other hand, you are admitting that in at least one key respect, it won't meet Thomas's requirements. On the third hand, you just said that you weren't arguing for two mechanisms for sharing a BufFile across cooperating parallel processes. I don't see how you can hold all three of those positions simultaneously. >> As I've said before, I think that's an anti-goal. This is a different >> problem, and trying to reuse the solution we chose for the >> non-parallel case doesn't really work. resowner.c could end up owning >> a shared reference count which it's responsible for decrementing -- >> and then decrementing it removes the file if the result is zero. But >> it can't own performing the actual unlink(), because then we can't >> support cases where the file may have multiple readers, since whoever >> owns the unlink() might try to zap the file out from under one of the >> others. > > Define "zap the file". I think, based on your remarks here, that > you've misunderstood my design. I think you should at least understand > it fully if you're going to dismiss it. zap was a colloquialism for unlink(). I concede that I don't fully understand your design, and am trying to understand those things I do not yet understand. > It is true that a worker resowner can unlink() the files > mid-unification, in the same manner as with conventional temp files, > and not decrement its refcount in shared memory, or care at all in any > special way. This is okay because the leader (in the case of parallel > tuplesort) will realize that it should not "turn out the lights", > finding that remaining reference when it calls BufFileClose() in > registered callback, as it alone must. It doesn't matter that the > unlink() may have already occurred, or may be just about to occur, > because we are only operating on already-opened files, and never on > the link itself (we don't have to stat() the file link for example, > which is naturally only a task for the unlink()'ing backend anyway). > You might say that the worker only blows away the link itself, not the > file proper, since it may still be open in leader (say). Well, that sounds like it's counting on fd.c not to close the file descriptor at an inconvenient point in time and reopen it later, which is not guaranteed. > Thomas' design cannot reliably know how many segments there are in > workers in error paths, which necessitates his unlink()-ENOENT-ignore > hack. My solution is that workers/owners look after their own temp > segments in the conventional way, until they reach BufFileClose(), > which may never come if there is an error. The only way that clean-up > won't happen in conventional resowner.c-in-worker fashion is if > BufFileClose() is reached in owner/worker. BufFileClose() must be > reached when there is no error, which has to happen anyway when using > temp files. (Else there is a temp file leak warning from resowner.c.) > > This is the only way to avoid the unlink()-ENOENT-ignore hack, AFAICT, > since only the worker itself can reliably know how many segments it > has opened at every single instant in time. Because it's the owner! Above, you said that your design would allow for a group of processes to share access to a file, with the last one that abandons it "turning out the lights". But here, you are referring to it as having one owner - the "only the worker itself" can know the number of segments. Those things are exact opposites of each other. I don't think there's any problem with ignoring ENOENT, and I don't think there's any need for a process to know the exact number of segments in some temporary file. In a shared-ownership environment, that information can't be stored in a backend-private cache; it's got to be available to whichever backend ends up being the last one out. There are only two ways to do that. One is to store it in shared memory, and the other is to discover it from the filesystem. The former is conceptually more appealing, but it can't handle Thomas's requirement of an unlimited number of files, so I think it makes sense to go with the latter. The only problem with that which I can see is that we might orphan some temporary files if the disk is flaky and filesystem operations are failing intermittently, but that's already a pretty bad situation which we're not going to make much worse with this approach. >> I want the >> last process to detach to be responsible for cleanup, regardless of >> which process that ends up being. I want that for lots of good >> reasons which I have articulated including (1) it's how all other >> resource management for parallel query already works, e.g. DSM, DSA, >> and group locking; (2) it avoids the need for one process to sit and >> wait until another process assumes ownership, which isn't a feature >> even if (as you contend, and I'm not convinced) it doesn't hurt much; >> and (3) it allows for use cases where multiple processes are reading >> from the same shared BufFile without the risk that some other process >> will try to unlink() the file while it's still in use. The point for >> me isn't so much whether unlink() ever ignores errors as whether >> cleanup (however defined) is an operation guaranteed to happen exactly >> once. > > My patch demonstrably has these properties. I've done quite a bit of > fault injection testing to prove it. I don't understand this comment, because 0 of the 3 properties that I just articulated are things which can be proved or disproved by fault injection. Fault injection can confirm the presence of bugs or suggest their absence, but none of those properties have to do with whether there are bugs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 22, 2017 at 5:44 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Actually, that's quite possible with the design I came up with. > > I don't think it is. What sequence of calls do the APIs you've > proposed would accomplish that goal? I don't see anything in this > patch set that would permit anything other than a handoff from the > worker to the leader. There seems to be no way for the ref count to > be more than 1 (or 2?). See my remarks on this below. >> The >> restriction that Thomas can't live with as I've left things is that >> you have to know the number of BufFiles ahead of time. I'm pretty sure >> that that's all it is. (I do sympathize with the fact that that isn't >> very helpful to him, though.) > > I feel like there's some cognitive dissonance here. On the one hand, > you're saying we should use your design. No, I'm not. I'm saying that my design is complete on its own terms, and has some important properties that a mechanism like this ought to have. I think I've been pretty clear on my general uncertainty about the broader question. > On the other hand, you are > admitting that in at least one key respect, it won't meet Thomas's > requirements. On the third hand, you just said that you weren't > arguing for two mechanisms for sharing a BufFile across cooperating > parallel processes. I don't see how you can hold all three of those > positions simultaneously. I respect your position as the person that completely owns parallelism here. You are correct when you say that there has to be some overlap between the requirements for the mechanisms used by each patch -- there just *has* to be. As I said, I only know very approximately how much overlap that is or should be, even at this late date, and I am unfortunately not in a position to spend more time on it to find out. C'est la vie. I know that I have no chance of convincing you to adopt my design here, and you are right not to accept the design, because there is a bigger picture. And, because it's just too late now. My efforts to get ahead of that, and anticipate and provide for Thomas' requirements have failed. I admit that. But, you are asserting that my patch has specific technical defects that it does not have. I structured things this way for a reason. You are not required to agree with me in full to see that I might have had a point. I've described it as a trade-off already. I think that it will be of practical value to you to see that trade-off. This insight is what allowed me to immediately zero in on resource leak bugs in Thomas' revision of the patch from yesterday. >> It is true that a worker resowner can unlink() the files >> mid-unification, in the same manner as with conventional temp files, >> and not decrement its refcount in shared memory, or care at all in any >> special way. This is okay because the leader (in the case of parallel >> tuplesort) will realize that it should not "turn out the lights", >> finding that remaining reference when it calls BufFileClose() in >> registered callback, as it alone must. It doesn't matter that the >> unlink() may have already occurred, or may be just about to occur, >> because we are only operating on already-opened files, and never on >> the link itself (we don't have to stat() the file link for example, >> which is naturally only a task for the unlink()'ing backend anyway). >> You might say that the worker only blows away the link itself, not the >> file proper, since it may still be open in leader (say). > > Well, that sounds like it's counting on fd.c not to close the file > descriptor at an inconvenient point in time and reopen it later, which > is not guaranteed. It's true that in an error path, if the FD of the file we just opened gets swapped out, that could happen. That seems virtually impossible, and in any case the consequence is no worse than a confusing LOG message. But, yes, that's a weakness. >> This is the only way to avoid the unlink()-ENOENT-ignore hack, AFAICT, >> since only the worker itself can reliably know how many segments it >> has opened at every single instant in time. Because it's the owner! > > Above, you said that your design would allow for a group of processes > to share access to a file, with the last one that abandons it "turning > out the lights". But here, you are referring to it as having one > owner - the "only the worker itself" can know the number of segments. > Those things are exact opposites of each other. You misunderstood. Under your analogy, the worker needs to wait for someone else to enter the room before leaving, because otherwise, as an "environmentally conscious" worker, it would be compelled to turn the lights out before anyone else ever got to do anything with its files. But once someone else is in the room, the worker is free to leave without turning out the lights. I could provide a mechanism for the leader, or whatever the other backend is, to do another hand off. You're right that that is left unimplemented, but it would be a trivial adjunct to what I came up with. > I don't think there's any problem with ignoring ENOENT, and I don't > think there's any need for a process to know the exact number of > segments in some temporary file. You may well be right, but that is just one detail. >> My patch demonstrably has these properties. I've done quite a bit of >> fault injection testing to prove it. > > I don't understand this comment, because 0 of the 3 properties that I > just articulated are things which can be proved or disproved by fault > injection. Fault injection can confirm the presence of bugs or > suggest their absence, but none of those properties have to do with > whether there are bugs. I was unclear -- I just meant (3). Specifically, that resource ownership has been shown to do be robust under stress testing/fault injection testing. Anyway, I will provide some feedback on Thomas' latest revision of today, before I bow out. I owe him at least that much. -- Peter Geoghegan
On 2017-02-10 07:52:57 -0500, Robert Haas wrote: > On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro > > Up until two minutes ago I assumed that policy would leave only two > > possibilities: you attach to the DSM segment and attach to the > > SharedBufFileManager successfully or you attach to the DSM segment and > > then die horribly (but not throw) and the postmaster restarts the > > whole cluster and blows all temp files away with RemovePgTempFiles(). > > But I see now in the comment of that function that crash-induced > > restarts don't call that because "someone might want to examine the > > temp files for debugging purposes". Given that policy for regular > > private BufFiles, I don't see why that shouldn't apply equally to > > shared files: after a crash restart, you may have some junk files that > > won't be cleaned up until your next clean restart, whether they were > > private or shared BufFiles. > > I think most people (other than Tom) would agree that that policy > isn't really sensible any more; it probably made sense when the > PostgreSQL user community was much smaller and consisted mostly of the > people developing PostgreSQL, but these days it's much more likely to > cause operational headaches than to help a developer debug. FWIW, we have restart_after_crash = false. If you need to debug things, you can enable that. Hence the whole RemovePgTempFiles() crash-restart exemption isn't required anymore, we have a much more targeted solution. - Andres
On Wed, Mar 22, 2017 at 3:19 AM, Thomas Munro <thomas.munro@enterprisedb.com > wrote:
As per the earlier discussion in the thread, I did experiment using
BufFileSet interface from parallel-hash-v18.patchset. I took the reference
of parallel-hash other patches to understand the BufFileSet APIs, and
incorporate the changes to parallel create index.
In order to achieve the same:
- Applied 0007-Remove-BufFile-s-isTemp- flag.patch and
0008-Add-BufFileSet-for- sharing-temporary-files- between-b.patch from the
parallel-hash-v18.patchset.
- Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE
INDEX v10 patch.
- incorporate the BufFileSet API to the parallel tuple sort for CREATE INDEX.
- Changes into few existing functions as well as added few to support the
BufFileSet changes.
To check the performance, I used the similar test which Peter posted in
his earlier thread. which is:
Machine: power2 machine with 512GB of RAM
Setup:
CREATE TABLE parallel_sort_test AS
SELECT hashint8(i) randint,
md5(i::text) collate "C" padding1,
md5(i::text || '2') collate "C" padding2
FROM generate_series(0, 1e9::bigint) i;
vacuum ANALYZE parallel_sort_test;
postgres=# show max_parallel_workers_per_ gather;
max_parallel_workers_per_ gather
------------------------------ ---
8
(1 row)
postgres=# show maintenance_work_mem;
maintenance_work_mem
----------------------
8GB
(1 row)
postgres=# show max_wal_size ;
max_wal_size
--------------
4GB
(1 row)
CREATE INDEX serial_idx ON parallel_sort_test (randint);
Without patch:
Time: 3430054.220 ms (57:10.054)
With patch (max_parallel_workers_ maintenance = 8):
Time: 1163445.271 ms (19:23.445)
Thanks to my colleague Thomas Munro for his help and off-line discussion
for the patch.
On Wed, Mar 22, 2017 at 10:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 21, 2017 at 3:50 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I disagree with that. It is a
>> trade-off, I suppose. I have now run out of time to work through it
>> with you or Thomas, though.
>
> Bummer.
I'm going to experiment with refactoring the v10 parallel CREATE INDEX
patch to use the SharedBufFileSet interface from
hj-shared-buf-file-v8.patch today and see what problems I run into.
As per the earlier discussion in the thread, I did experiment using
BufFileSet interface from parallel-hash-v18.patchset. I took the reference
of parallel-hash other patches to understand the BufFileSet APIs, and
incorporate the changes to parallel create index.
In order to achieve the same:
- Applied 0007-Remove-BufFile-s-isTemp-
0008-Add-BufFileSet-for-
parallel-hash-v18.patchset.
- Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE
INDEX v10 patch.
- incorporate the BufFileSet API to the parallel tuple sort for CREATE INDEX.
- Changes into few existing functions as well as added few to support the
BufFileSet changes.
To check the performance, I used the similar test which Peter posted in
his earlier thread. which is:
Machine: power2 machine with 512GB of RAM
Setup:
CREATE TABLE parallel_sort_test AS
SELECT hashint8(i) randint,
md5(i::text) collate "C" padding1,
md5(i::text || '2') collate "C" padding2
FROM generate_series(0, 1e9::bigint) i;
vacuum ANALYZE parallel_sort_test;
postgres=# show max_parallel_workers_per_
max_parallel_workers_per_
------------------------------
8
(1 row)
postgres=# show maintenance_work_mem;
maintenance_work_mem
----------------------
8GB
(1 row)
postgres=# show max_wal_size ;
max_wal_size
--------------
4GB
(1 row)
CREATE INDEX serial_idx ON parallel_sort_test (randint);
Without patch:
Time: 3430054.220 ms (57:10.054)
With patch (max_parallel_workers_
Time: 1163445.271 ms (19:23.445)
Thanks to my colleague Thomas Munro for his help and off-line discussion
for the patch.
Attaching v11 patch and trace_sort output for the test.
Thanks,
Rushabh Lathia
Attachment
On Tue, Sep 19, 2017 at 3:21 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > As per the earlier discussion in the thread, I did experiment using > BufFileSet interface from parallel-hash-v18.patchset. I took the reference > of parallel-hash other patches to understand the BufFileSet APIs, and > incorporate the changes to parallel create index. > > In order to achieve the same: > > - Applied 0007-Remove-BufFile-s-isTemp-flag.patch and > 0008-Add-BufFileSet-for-sharing-temporary-files-between-b.patch from the > parallel-hash-v18.patchset. > - Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE > INDEX v10 patch. > - incorporate the BufFileSet API to the parallel tuple sort for CREATE > INDEX. > - Changes into few existing functions as well as added few to support the > BufFileSet changes. I'm glad that somebody is working on this. (Someone closer to the more general work on shared/parallel BufFile infrastructure than I am.) I do have some quick feedback, and I hope to be able to provide that to both you and Thomas, as needed to see this one through. I'm not going to get into the tricky details around resource management just yet. I'll start with some simpler questions, to get a general sense of the plan here. I gather that you're at least aware that your v11 of the patch doesn't preserve randomAccess support for parallel sorts, because you didn't include my 0002-* testing GUCs patch, which was specifically designed to make various randomAccess stuff testable. I also figured this to be true because I noticed this FIXME among (otherwise unchanged) tuplesort code: > +static void > +leader_takeover_tapes(Tuplesortstate *state) > +{ > + Sharedsort *shared = state->shared; > + int nLaunched = state->nLaunched; > + int j; > + > + Assert(LEADER(state)); > + Assert(nLaunched >= 1); > + Assert(nLaunched == shared->workersFinished); > + > + /* > + * Create the tapeset from worker tapes, including a leader-owned tape at > + * the end. Parallel workers are far more expensive than logical tapes, > + * so the number of tapes allocated here should never be excessive. FIXME > + */ > + inittapestate(state, nLaunched + 1); > + state->tapeset = LogicalTapeSetCreate(nLaunched + 1, shared->tapes, > + state->fileset, state->worker); It's not surprising to me that you do not yet have this part working, because much of my design was about changing as little as possible above the BufFile interface, in order for tuplesort.c (and logtape.c) code like this to "just work" as if it was the serial case. It doesn't look like you've added the kind of BufFile multiplexing code that I expected to see in logtape.c. This is needed to compensate for the code removed from fd.c and buffile.c. Perhaps it would help me to go look at Thomas' latest parallel hash join patch -- did it gain some kind of transparent multiplexing ability that you actually (want to) use here? Though randomAccess isn't used by CREATE INDEX in general, and so not supporting randomAccess within tuplesort.c for parallel callers doesn't matter as far as this CREATE INDEX user-visible feature is concerned, I still believe that randomAccess is important (IIRC, Robert thought so too). Specifically, it seems like a good idea to have randomAccess support, both on general principle (why should the parallel case be different?), and because having it now will probably enable future enhancements to logtape.c. Enhancements that have it manage parallel sorts based on partitioning/distribution/bucketing [1]. I'm pretty sure that partitioning-based parallel sort is going to become very important in the future, especially for parallel GroupAggregate. The leader needs to truly own the tapes it reclaims from workers in order for all of this to work. Questions on where you're going with randomAccess support: 1. Is randomAccess support a goal for you here at all? 2. If so, is preserving eager recycling of temp file space during randomAccess (materializing a final output tape within the leader) another goal for you here? Do we need to preserve that property of serial external sorts, too, so that it remains true that logtape.c ensures that "the total space usage is essentially just the actual data volume, plus insignificant bookkeeping and start/stop overhead"? (I'm quoting from master's logtape.c header comments.) 3. Any ideas on next steps in support of those 2 goals? What problems do you foresee, if any? > CREATE INDEX serial_idx ON parallel_sort_test (randint); > > Without patch: > > Time: 3430054.220 ms (57:10.054) > > With patch (max_parallel_workers_maintenance = 8): > > Time: 1163445.271 ms (19:23.445) This looks very similar to my v10. While I will need to follow up on this, to make sure, it seems likely that this patch has exactly the same performance characteristics as v10. Thanks [1] https://wiki.postgresql.org/wiki/Parallel_External_Sort#Partitioning_for_parallelism_.28parallel_external_sort_beyond_CREATE_INDEX.29 -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 20, 2017 at 5:17 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Sep 19, 2017 at 3:21 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> As per the earlier discussion in the thread, I did experiment using
> BufFileSet interface from parallel-hash-v18.patchset. I took the reference
> of parallel-hash other patches to understand the BufFileSet APIs, and
> incorporate the changes to parallel create index.
>
> In order to achieve the same:
>
> - Applied 0007-Remove-BufFile-s-isTemp-flag.patch and
> 0008-Add-BufFileSet-for-sharing-temporary-files- between-b.patch from the
> parallel-hash-v18.patchset.
> - Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE
> INDEX v10 patch.
> - incorporate the BufFileSet API to the parallel tuple sort for CREATE
> INDEX.
> - Changes into few existing functions as well as added few to support the
> BufFileSet changes.
I'm glad that somebody is working on this. (Someone closer to the more
general work on shared/parallel BufFile infrastructure than I am.)
I do have some quick feedback, and I hope to be able to provide that
to both you and Thomas, as needed to see this one through. I'm not
going to get into the tricky details around resource management just
yet. I'll start with some simpler questions, to get a general sense of
the plan here.
Thanks Peter.
I gather that you're at least aware that your v11 of the patch doesn't
preserve randomAccess support for parallel sorts, because you didn't
include my 0002-* testing GUCs patch, which was specifically designed
to make various randomAccess stuff testable. I also figured this to be
true because I noticed this FIXME among (otherwise unchanged)
tuplesort code:
Yes, I haven't touched the randomAccess part yet. My initial goal was
to incorporate the BufFileSet api's here.
> +static void
> +leader_takeover_tapes(Tuplesortstate *state)
> +{
> + Sharedsort *shared = state->shared;
> + int nLaunched = state->nLaunched;
> + int j;
> +
> + Assert(LEADER(state));
> + Assert(nLaunched >= 1);
> + Assert(nLaunched == shared->workersFinished);
> +
> + /*
> + * Create the tapeset from worker tapes, including a leader-owned tape at
> + * the end. Parallel workers are far more expensive than logical tapes,
> + * so the number of tapes allocated here should never be excessive. FIXME
> + */
> + inittapestate(state, nLaunched + 1);
> + state->tapeset = LogicalTapeSetCreate(nLaunched + 1, shared->tapes,
> + state->fileset, state->worker);
It's not surprising to me that you do not yet have this part working,
because much of my design was about changing as little as possible
above the BufFile interface, in order for tuplesort.c (and logtape.c)
code like this to "just work" as if it was the serial case.
Right. I just followed your design in the your earlier patches.
It doesn't
look like you've added the kind of BufFile multiplexing code that I
expected to see in logtape.c. This is needed to compensate for the
code removed from fd.c and buffile.c. Perhaps it would help me to go
look at Thomas' latest parallel hash join patch -- did it gain some
kind of transparent multiplexing ability that you actually (want to)
use here?
Sorry, I didn't get this part. Are you talking about the your patch changes
into OpenTemporaryFileInTablespace(), BufFileUnify() and other changes
related to ltsUnify() ? If that's the case, I don't think it require with the
BufFileSet. Correct me if I am wrong here.
Though randomAccess isn't used by CREATE INDEX in general, and so not
supporting randomAccess within tuplesort.c for parallel callers
doesn't matter as far as this CREATE INDEX user-visible feature is
concerned, I still believe that randomAccess is important (IIRC,
Robert thought so too). Specifically, it seems like a good idea to
have randomAccess support, both on general principle (why should the
parallel case be different?), and because having it now will probably
enable future enhancements to logtape.c. Enhancements that have it
manage parallel sorts based on partitioning/distribution/bucketing
[1]. I'm pretty sure that partitioning-based parallel sort is going to
become very important in the future, especially for parallel
GroupAggregate. The leader needs to truly own the tapes it reclaims
from workers in order for all of this to work.
First application for the tuplesort here is CREATE INDEX and that doesn't
need randomAccess. But as you said and in the thread its been discussed,
randomAccess is an important and we should sure put an efforts to support
randomAccess is an important and we should sure put an efforts to support
the same.
Questions on where you're going with randomAccess support:
1. Is randomAccess support a goal for you here at all?
2. If so, is preserving eager recycling of temp file space during
randomAccess (materializing a final output tape within the leader)
another goal for you here? Do we need to preserve that property of
serial external sorts, too, so that it remains true that logtape.c
ensures that "the total space usage is essentially just the actual
data volume, plus insignificant bookkeeping and start/stop overhead"?
(I'm quoting from master's logtape.c header comments.)
3. Any ideas on next steps in support of those 2 goals? What problems
do you foresee, if any?
To be frank its too early for me to comment anything in this area. I need
to study this more closely. As an initial goal I was just focused on
understanding the current implementation of the patch and incorporate
the BufFileSet APIs.
understanding the current implementation of the patch and incorporate
the BufFileSet APIs.
> CREATE INDEX serial_idx ON parallel_sort_test (randint);
>
> Without patch:
>
> Time: 3430054.220 ms (57:10.054)
>
> With patch (max_parallel_workers_maintenance = 8):
>
> Time: 1163445.271 ms (19:23.445)
This looks very similar to my v10. While I will need to follow up on
this, to make sure, it seems likely that this patch has exactly the
same performance characteristics as v10.
Its 2.96x, more or less similar to your v10. Might be some changes due
to different testing environment.
Thanks
[1] https://wiki.postgresql.org/wiki/Parallel_External_Sort# Partitioning_for_parallelism_. 28parallel_external_sort_ beyond_CREATE_INDEX.29
--
Peter Geoghegan
Thanks,
Rushabh Lathia
On Wed, Sep 20, 2017 at 5:32 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > First application for the tuplesort here is CREATE INDEX and that doesn't > need randomAccess. But as you said and in the thread its been discussed, > randomAccess is an important and we should sure put an efforts to support > the same. There's no direct benefit of working on randomAccess support unless we have some code that wants to use that support for something. Indeed, it would just leave us with code we couldn't test. While I do agree that there are probably use cases for randomAccess, I think what we should do right now is try to get this patch reviewed and committed so that we have parallel CREATE INDEX for btree indexes. And in so doing, let's keep it as simple as possible. Parallel CREATE INDEX for btree indexes is a great feature without adding any more complexity. Later, anybody who wants to work on randomAccess support -- and whatever planner and executor changes are needed to make effective use of it -- can do so. For example, one can imagine a plan like this: Gather -> Merge Join -> Parallel Index Scan -> Parallel Sort -> Parallel Seq Scan If the parallel sort reads out all of the output in every worker, then it becomes legal to do this kind of thing -- it would end up, I think, being quite similar to Parallel Hash. However, there's some question in my mind as to whether want to do this or, say, hash-partition both relations and then perform separate joins on each partition. The above plan is clearly better than what we can do today, where every worker would have to repeat the sort, ugh, but I don't know if it's the best plan. Fortunately, to get this patch committed, we don't have to figure that out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 20, 2017 at 2:32 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > Yes, I haven't touched the randomAccess part yet. My initial goal was > to incorporate the BufFileSet api's here. This is going to need a rebase, due to the commit today to remove replacement selection sort. That much should be easy. > Sorry, I didn't get this part. Are you talking about the your patch changes > into OpenTemporaryFileInTablespace(), BufFileUnify() and other changes > related to ltsUnify() ? If that's the case, I don't think it require with > the > BufFileSet. Correct me if I am wrong here. I thought that you'd have multiple BufFiles, which would be multiplexed (much like a single BufFile itself mutiplexes 1GB segments), so that logtape.c could still recycle space in the randomAccess case. I guess that that's not a goal now. > To be frank its too early for me to comment anything in this area. I need > to study this more closely. As an initial goal I was just focused on > understanding the current implementation of the patch and incorporate > the BufFileSet APIs. Fair enough. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 30, 2017 at 5:06 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Sep 20, 2017 at 2:32 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Yes, I haven't touched the randomAccess part yet. My initial goal was
> to incorporate the BufFileSet api's here.
This is going to need a rebase, due to the commit today to remove
replacement selection sort. That much should be easy.
Sorry for delay, here is rebase version of patch.
> Sorry, I didn't get this part. Are you talking about the your patch changes
> into OpenTemporaryFileInTablespace(), BufFileUnify() and other changes
> related to ltsUnify() ? If that's the case, I don't think it require with
> the
> BufFileSet. Correct me if I am wrong here.
I thought that you'd have multiple BufFiles, which would be
multiplexed (much like a single BufFile itself mutiplexes 1GB
segments), so that logtape.c could still recycle space in the
randomAccess case. I guess that that's not a goal now.
Hmm okay.
> To be frank its too early for me to comment anything in this area. I need
> to study this more closely. As an initial goal I was just focused on
> understanding the current implementation of the patch and incorporate
> the BufFileSet APIs.
Fair enough.
Thanks,
--
Rushabh Lathia
Attachment
Attaching the re based patch according to the v22 parallel-hash patch sets
ThanksOn Tue, Oct 10, 2017 at 2:53 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
On Sat, Sep 30, 2017 at 5:06 AM, Peter Geoghegan <pg@bowt.ie> wrote:On Wed, Sep 20, 2017 at 2:32 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Yes, I haven't touched the randomAccess part yet. My initial goal was
> to incorporate the BufFileSet api's here.
This is going to need a rebase, due to the commit today to remove
replacement selection sort. That much should be easy.Sorry for delay, here is rebase version of patch.
> Sorry, I didn't get this part. Are you talking about the your patch changes
> into OpenTemporaryFileInTablespace(), BufFileUnify() and other changes
> related to ltsUnify() ? If that's the case, I don't think it require with
> the
> BufFileSet. Correct me if I am wrong here.
I thought that you'd have multiple BufFiles, which would be
multiplexed (much like a single BufFile itself mutiplexes 1GB
segments), so that logtape.c could still recycle space in the
randomAccess case. I guess that that's not a goal now.Hmm okay.
> To be frank its too early for me to comment anything in this area. I need
> to study this more closely. As an initial goal I was just focused on
> understanding the current implementation of the patch and incorporate
> the BufFileSet APIs.
Fair enough.Thanks,
--Rushabh Lathia
--
Rushabh Lathia
Attachment
On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > Attaching the re based patch according to the v22 parallel-hash patch sets I took a quick look at this today, and noticed a few issues: * make_name() is used to name files in sharedtuplestore.c, which is what is passed to BufFileOpenShared() for parallel hash join. Your using your own logic for that within the equivalent logtape.c call to BufFileOpenShared(), presumably because make_name() wants to identify participants by PID rather than by an ordinal identifier number. I think that we need some kind of central registry for things that use shared buffiles. It could be that sharedtuplestore.c is further generalized to support this, or it could be that they both call something else that takes care of naming. It's not okay to have this left to random chance. You're going to have to ask Thomas about this. You should also use MAXPGPATH for the char buffer on the stack. * This logtape.c comment needs to be updated, as it's no longer true: * successfully. In general, workers can take it that the leader will* reclaim space in files under their ownership, andso should not* reread from tape. * Robert hated the comment changes in the header of nbtsort.c. You might want to change it back, because he is likely to be the one that commits this. * You should look for similar comments in tuplesort.c (IIRC a couple of places will need to be revised). * tuplesort_begin_common() should actively reject a randomAccess parallel case using elog(ERROR). * tuplesort.h should note that randomAccess isn't supported, too. * What's this all about?: + /* Accessor for the SharedBufFileSet that is at the end of Sharedsort. */ + #define GetSharedBufFileSet(shared) \ + ((BufFileSet *) (&(shared)->tapes[(shared)->nTapes])) You can't just cast from one type to the other without regard for the underling size of the shared memory buffer, which is what this looks like to me. This only fails to crash because you're only abusing the last member in the tapes array for this purpose, and there happens to be enough shared memory slop that you get away with it. I'm pretty sure that ltsUnify() ends up clobbering the last/leader tape, which is a place where BufFileSet is also used, so this is just wrong. You should rethink the shmem structure a little bit. * There is still that FIXME comment within leader_takeover_tapes(). I believe that you should still have a leader tape (at least in local memory in the leader), even though you'll never be able to do anything with it, since randomAccess is no longer supported. You can remove the FIXME, and just note that you have a leader tape to be consistent with the serial case, though recognize that it's not useful. Note that even with randomAccess, we always had the leader tape, so it's not that different, really. I suppose it might make sense to make shared->tapes not have a leader tape. It hardly matters -- perhaps you should leave it there in order to keep the code simple, as you'll be keeping the leader tape in local memory, too. (But it still won't fly to continue to clobber it, of course -- you still need to find a dedicated place for BufFileSet in shared memory.) That's all I have right now. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 1, 2017 at 11:29 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia > <rushabh.lathia@gmail.com> wrote: >> Attaching the re based patch according to the v22 parallel-hash patch sets > > I took a quick look at this today, and noticed a few issues: > > * make_name() is used to name files in sharedtuplestore.c, which is > what is passed to BufFileOpenShared() for parallel hash join. Your > using your own logic for that within the equivalent logtape.c call to > BufFileOpenShared(), presumably because make_name() wants to identify > participants by PID rather than by an ordinal identifier number. So that's this bit: + pg_itoa(worker, filename); + lts->pfile = BufFileCreateShared(fileset, filename); ... and: + pg_itoa(i, filename); + file = BufFileOpenShared(fileset, filename); What's wrong with using a worker number like this? > I think that we need some kind of central registry for things that use > shared buffiles. It could be that sharedtuplestore.c is further > generalized to support this, or it could be that they both call > something else that takes care of naming. It's not okay to have this > left to random chance. It's not random choice: buffile.c creates a uniquely named directory (or directories, if you have more than one location configured in the temp_tablespaces GUC) to hold all the backing files involved in each BufFileSet. Naming of BufFiles within the BufFileSet is the caller's problem, and a worker number seems like a reasonable choice to me. It won't collide with a concurrent parallel CREATE INDEX because that'll be using its own BufFileSet. > You're going to have to ask Thomas about this. You should also use > MAXPGPATH for the char buffer on the stack. Here's a summary of namespace management scheme I currently have at the three layers fd.c, buffile.c, sharedtuplestore.c: 1. fd.c has new lower-level functions provides PathNameCreateTemporaryFile(const char *path) and PathNameOpenTemporaryFile(const char *path). It also provides PathNameCreateTemporaryDir(). Clearly callers of these interfaces will need to be very careful about managing the names they use. Callers also own the problem of cleaning up files, since there is no automatic cleanup of files created this way. My intention was that these facilities would *only* be used by BufFileSet, since it has machinery to manage those things. 2. buffile.c introduces BufFileSet, which is conceptually a set of BufFiles that can be shared by multiple backends with DSM segment-scoped cleanup. It is implemented as a set of directories: one for each tablespace in temp_tablespaces. It controls the naming of those directories. The BufFileSet directories are named similarly to fd.c's traditional temporary file names using the usual recipe "pgsql_tmp" + PID + per-process counter but have an additional ".set" suffix. RemovePgTempFilesInDir() recognises directories with that prefix and suffix as junk left over from a crash when cleaning up. I suppose it's that knowledge about reserved name patterns and cleanup that you are thinking of as a central registry? As for the BufFiles that are in a BufFileSet, buffile.c has no opinion on that: the calling code (parallel CREATE INDEX, sharedtuplestore.c, ...) is responsible for coming up with its own scheme. If parallel CREATE INDEX wants to name shared BufFiles "walrus" and "banana", that's OK by me, and those files won't collide with anything in another BufFileSet because each BufFileSet has its own directory (-ies). One complaint about the current coding that someone might object to: MakeSharedSegmentPath() just dumps the caller's BufFile name into a path without sanitisation: I should fix that so that we only accept fairly limited strings here. Another complaint is that perhaps fd.c knows too much about buffile.c's business. For example, RemovePgTempFilesInDir() knows about the ".set" directories created by buffile.c, which might be called a layering violation. Perhaps the set/directory logic should move entirely into fd.c, so you'd call FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then BufFileOpenShared() would take a FileSet *, not a BufFileSet *. Thoughts? 3. sharedtuplestore.c takes a caller-supplied BufFileSet and creates its shared BufFiles in there. Earlier versions created and owned a BufFileSet, but in the current Parallel Hash patch I create loads of separate SharedTuplestore objects but I didn't want to create load of directories to back them, so you can give them all the same BufFileSet. That works because SharedTuplestores are also given a name, and it's the caller's job (in my case nodeHash.c) to make sure the SharedTuplestores are given unique names within the same BufFileSet. For Parallel Hash you'll see names like 'i3of8' (inner batch 3 of 8). There is no need for there to be in any sort of central registry for that though, because it rides on top of the guarantees from 2 above: buffile.c will put those files into a uniquely named directory, and that works as long as no one else is allowed to create files or directories in the temp directory that collide with its reserved pattern /^pgsql_tmp.+\.set$/. For the same reason, parallel CREATE INDEX is free to use worker numbers as BufFile names, since it has its own BufFileSet to work within. > * What's this all about?: > > + /* Accessor for the SharedBufFileSet that is at the end of Sharedsort. */ > + #define GetSharedBufFileSet(shared) \ > + ((BufFileSet *) (&(shared)->tapes[(shared)->nTapes])) In an earlier version, BufFileSet was one of those annoying data structures with a FLEXIBLE_ARRAY_MEMBER that you'd use as an incomplete type (declared but not defined in the includable header), and here it was being used "inside" (or rather after) SharedSort, which *itself* had a FLEXIBLE_ARRAY_MEMBER. The reason for the variable sized object was that I needed all backends to agree on the set of temporary tablespace OIDs, of which there could be any number, but I also needed a 'flat' (pointer-free) object I could stick in relocatable shared memory. In the newest version I changed that flexible array to tablespaces[8], because 8 should be enough tablespaces for anyone (TM). I don't really believe anyone uses temp_tablespaces for IO load balancing anymore and I hate code like the above. So I think Rushabh should now remove the above-quoted code and just use a BufFileSet directly as a member of SharedSort. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 31, 2017 at 5:07 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > So that's this bit: > > + pg_itoa(worker, filename); > + lts->pfile = BufFileCreateShared(fileset, filename); > > ... and: > > + pg_itoa(i, filename); > + file = BufFileOpenShared(fileset, filename); Right. > What's wrong with using a worker number like this? I guess nothing, though there is the question of discoverability for DBAs, etc. You do address this separately, by having (potentially) descriptive filenames, as you go into. > It's not random choice: buffile.c creates a uniquely named directory > (or directories, if you have more than one location configured in the > temp_tablespaces GUC) to hold all the backing files involved in each > BufFileSet. Naming of BufFiles within the BufFileSet is the caller's > problem, and a worker number seems like a reasonable choice to me. It > won't collide with a concurrent parallel CREATE INDEX because that'll > be using its own BufFileSet. Oh, I see. I may have jumped the gun on that one. > One complaint about the current coding that someone might object to: > MakeSharedSegmentPath() just dumps the caller's BufFile name into a > path without sanitisation: I should fix that so that we only accept > fairly limited strings here. Another complaint is that perhaps fd.c > knows too much about buffile.c's business. For example, > RemovePgTempFilesInDir() knows about the ".set" directories created by > buffile.c, which might be called a layering violation. Perhaps the > set/directory logic should move entirely into fd.c, so you'd call > FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then > BufFileOpenShared() would take a FileSet *, not a BufFileSet *. > Thoughts? I'm going to make an item on my personal TODO list for that. No useful insights on that right now, though. > 3. sharedtuplestore.c takes a caller-supplied BufFileSet and creates > its shared BufFiles in there. Earlier versions created and owned a > BufFileSet, but in the current Parallel Hash patch I create loads of > separate SharedTuplestore objects but I didn't want to create load of > directories to back them, so you can give them all the same > BufFileSet. That works because SharedTuplestores are also given a > name, and it's the caller's job (in my case nodeHash.c) to make sure > the SharedTuplestores are given unique names within the same > BufFileSet. For Parallel Hash you'll see names like 'i3of8' (inner > batch 3 of 8). There is no need for there to be in any sort of > central registry for that though, because it rides on top of the > guarantees from 2 above: buffile.c will put those files into a > uniquely named directory, and that works as long as no one else is > allowed to create files or directories in the temp directory that > collide with its reserved pattern /^pgsql_tmp.+\.set$/. For the same > reason, parallel CREATE INDEX is free to use worker numbers as BufFile > names, since it has its own BufFileSet to work within. If the new standard is that you have temp file names that suggest the purpose of each temp file, then that may be something that parallel CREATE INDEX should buy into. > In an earlier version, BufFileSet was one of those annoying data > structures with a FLEXIBLE_ARRAY_MEMBER that you'd use as an > incomplete type (declared but not defined in the includable header), > and here it was being used "inside" (or rather after) SharedSort, > which *itself* had a FLEXIBLE_ARRAY_MEMBER. The reason for the > variable sized object was that I needed all backends to agree on the > set of temporary tablespace OIDs, of which there could be any number, > but I also needed a 'flat' (pointer-free) object I could stick in > relocatable shared memory. In the newest version I changed that > flexible array to tablespaces[8], because 8 should be enough > tablespaces for anyone (TM). I guess that that's something that you'll need to take up with Andres, if you haven't already. I have a hard time imagining a single query needed to use more than that many tablespaces at once, so maybe this is fine. > I don't really believe anyone uses > temp_tablespaces for IO load balancing anymore and I hate code like > the above. So I think Rushabh should now remove the above-quoted code > and just use a BufFileSet directly as a member of SharedSort. FWIW, I agree with you that nobody uses temp_tablespaces this way these days. This seems like a discussion for your hash join patch, though. I'm happy to buy into that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 1, 2017 at 2:11 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Oct 31, 2017 at 5:07 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> Another complaint is that perhaps fd.c >> knows too much about buffile.c's business. For example, >> RemovePgTempFilesInDir() knows about the ".set" directories created by >> buffile.c, which might be called a layering violation. Perhaps the >> set/directory logic should move entirely into fd.c, so you'd call >> FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then >> BufFileOpenShared() would take a FileSet *, not a BufFileSet *. >> Thoughts? > > I'm going to make an item on my personal TODO list for that. No useful > insights on that right now, though. I decided to try that, but it didn't really work: fd.h gets included by front-end code, so I can't very well define a struct and declare functions that deal in dsm_segment and slock_t. On the other hand it does seem a bit better to for these shared file sets to work in terms of File, not BufFile. That way you don't have to opt in to BufFile's double buffering and segmentation schemes just to get shared file clean-up, if for some reason you want direct file handles. So I in the v24 parallel hash patch set I just posted over in the other thread I have moved it into its own translation unit sharedfileset.c and made it work with File objects. buffile.c knows how to use it as a source of segment files. I think that's better. > If the new standard is that you have temp file names that suggest the > purpose of each temp file, then that may be something that parallel > CREATE INDEX should buy into. Yeah, I guess that could be useful. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> I'm going to make an item on my personal TODO list for that. No useful >> insights on that right now, though. > >I decided to try that, but it didn't really work: fd.h gets included >by front-end code, so I can't very well define a struct and declare >functions that deal in dsm_segment and slock_t. On the other hand it >does seem a bit better to for these shared file sets to work in terms >of File, not BufFile. Realistically, fd.h has a number of functions that are really owned by buffile.c already. This sounds fine. > That way you don't have to opt in to BufFile's >double buffering and segmentation schemes just to get shared file >clean-up, if for some reason you want direct file handles. Is that something that you really think is possible? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 3, 2017 at 2:24 PM, Peter Geoghegan <pg@bowt.ie> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> That way you don't have to opt in to BufFile's >> double buffering and segmentation schemes just to get shared file >> clean-up, if for some reason you want direct file handles. > > Is that something that you really think is possible? It's pretty far fetched, but maybe shared temporary relation files accessed via smgr.c/md.c? Or maybe future things that don't want to read/write through a buffer but instead want to mmap it. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Thanks Peter and Thomas for the review comments.
On Wed, Nov 1, 2017 at 3:59 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Attaching the re based patch according to the v22 parallel-hash patch sets
I took a quick look at this today, and noticed a few issues:
* make_name() is used to name files in sharedtuplestore.c, which is
what is passed to BufFileOpenShared() for parallel hash join. Your
using your own logic for that within the equivalent logtape.c call to
BufFileOpenShared(), presumably because make_name() wants to identify
participants by PID rather than by an ordinal identifier number.
I think that we need some kind of central registry for things that use
shared buffiles. It could be that sharedtuplestore.c is further
generalized to support this, or it could be that they both call
something else that takes care of naming. It's not okay to have this
left to random chance.
You're going to have to ask Thomas about this. You should also use
MAXPGPATH for the char buffer on the stack.
Used MAXPGPATH for the char buffer.
* This logtape.c comment needs to be updated, as it's no longer true:
* successfully. In general, workers can take it that the leader will
* reclaim space in files under their ownership, and so should not
* reread from tape.
Done.
* Robert hated the comment changes in the header of nbtsort.c. You
might want to change it back, because he is likely to be the one that
commits this.
* You should look for similar comments in tuplesort.c (IIRC a couple
of places will need to be revised).
Pending.
* tuplesort_begin_common() should actively reject a randomAccess
parallel case using elog(ERROR).
Done.
* tuplesort.h should note that randomAccess isn't supported, too.
Done.
* What's this all about?:
+ /* Accessor for the SharedBufFileSet that is at the end of Sharedsort. */
+ #define GetSharedBufFileSet(shared) \
+ ((BufFileSet *) (&(shared)->tapes[(shared)->nTapes]))
You can't just cast from one type to the other without regard for the
underling size of the shared memory buffer, which is what this looks
like to me. This only fails to crash because you're only abusing the
last member in the tapes array for this purpose, and there happens to
be enough shared memory slop that you get away with it. I'm pretty
sure that ltsUnify() ends up clobbering the last/leader tape, which is
a place where BufFileSet is also used, so this is just wrong. You
should rethink the shmem structure a little bit.
Fixed this by adding a SharedFileSet directly into the Sharedsort struct.
Thanks Thomas Munro for the offline help here.
* There is still that FIXME comment within leader_takeover_tapes(). I
believe that you should still have a leader tape (at least in local
memory in the leader), even though you'll never be able to do anything
with it, since randomAccess is no longer supported. You can remove the
FIXME, and just note that you have a leader tape to be consistent with
the serial case, though recognize that it's not useful. Note that even
with randomAccess, we always had the leader tape, so it's not that
different, really.
Done.
I suppose it might make sense to make shared->tapes not have a leader
tape. It hardly matters -- perhaps you should leave it there in order
to keep the code simple, as you'll be keeping the leader tape in local
memory, too. (But it still won't fly to continue to clobber it, of
course -- you still need to find a dedicated place for BufFileSet in
shared memory.)
Attaching the latest patch (v13) here and I will continue working on the comment
improvement part for nbtsort.c and tuplesort.c. Also will perform more testing
with the attached patch.
Patch is re-base of v25 patch set of Parallel hash.
Thanks,
Rushabh Lathia
Attachment
On Tue, Nov 14, 2017 at 1:41 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > Thanks Peter and Thomas for the review comments. No problem. More feedback: * I don't really see much need for this: + elog(LOG, "Worker for create index %d", parallel_workers); You can just use trace_sort, and observe the actual behavior of the sort that way. * As I said before, you should remove the header comments within nbtsort.c. * This should just say "write routines": + * This is why write/recycle routines don't need to know about offsets at + * all. * You didn't point out the randomAccess restriction in tuplesort.h. * I can't remember why I added the Valgrind suppression at this point. I'd remove it until the reason becomes clear, which may never happen. The regression tests should still pass without Valgrind warnings. * You can add back comments removed from above LogicalTapeTell(). I made these changes because it looked like we should close out the possibility of doing a tell during the write phase, as unified tapes actually would make that hard (no one does what it describes anyway). But now, unified tapes are a distinct case to frozen tapes in a way that they weren't before, so there is no need to make it impossible. I also think you should replace "Assert(lt->frozen)" with "Assert(lt->offsetBlockNumber == 0L)", for the same reason. -- Peter Geoghegan
On Tue, Nov 14, 2017 at 10:01 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Nov 14, 2017 at 1:41 AM, Rushabh Lathia > <rushabh.lathia@gmail.com> wrote: >> Thanks Peter and Thomas for the review comments. > > No problem. More feedback: I see that Robert just committed support for a parallel_leader_participation GUC. Parallel tuplesort should use this, too. It will be easy to adopt the patch to make this work. Just change the code within nbtsort.c to respect parallel_leader_participation, rather than leaving that as a compile-time switch. Remove the force_single_worker variable, and use !parallel_leader_participation in its place. The parallel_leader_participation docs will also need to be updated. -- Peter Geoghegan
On Tue, Nov 14, 2017 at 11:31 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Nov 14, 2017 at 1:41 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> Thanks Peter and Thomas for the review comments.
No problem. More feedback:
* I don't really see much need for this:
+ elog(LOG, "Worker for create index %d", parallel_workers);
You can just use trace_sort, and observe the actual behavior of the
sort that way.
Right, that was just added for the testing purposed. Removed in the
latest version of the patch.
* As I said before, you should remove the header comments within nbtsort.c.
Done.
* This should just say "write routines":
+ * This is why write/recycle routines don't need to know about offsets at
+ * all.
Okay, done.
* You didn't point out the randomAccess restriction in tuplesort.h.
I did, it's there in the file header comments.
* I can't remember why I added the Valgrind suppression at this point.
I'd remove it until the reason becomes clear, which may never happen.
The regression tests should still pass without Valgrind warnings.
Make sense.
* You can add back comments removed from above LogicalTapeTell(). I
made these changes because it looked like we should close out the
possibility of doing a tell during the write phase, as unified tapes
actually would make that hard (no one does what it describes anyway).
But now, unified tapes are a distinct case to frozen tapes in a way
that they weren't before, so there is no need to make it impossible.
I also think you should replace "Assert(lt->frozen)" with
"Assert(lt->offsetBlockNumber == 0L)", for the same reason.
Yep, done.
I see that Robert just committed support for a
parallel_leader_participation GUC. Parallel tuplesort should use this,
too.
It will be easy to adopt the patch to make this work. Just change the
code within nbtsort.c to respect parallel_leader_participation, rather
than leaving that as a compile-time switch. Remove the
force_single_worker variable, and use !parallel_leader_participation
in its place.
Added handling for parallel_leader_participation as well as deleted
compile time option force_single_worker.
The parallel_leader_participation docs will also need to be updated.
Done.
ON and OFF. Found one issue, where earlier we always used to call
_bt_leader_sort_as_worker() but now need to skip the call if parallel_leader_participation
_bt_leader_sort_as_worker() but now need to skip the call if parallel_leader_participation
is OFF.
Also fixed the documentation and the compilation error for the documentation.
PFA v14 patch.
...
...
Thanks,
Rushabh Lathia
Attachment
On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > 0001-Add-parallel-B-tree-index-build-sorting_v14.patch Cool. I'm glad that we now have a patch that applies cleanly against master, while adding very little to buffile.c. It feels like we're getting very close here. >> * You didn't point out the randomAccess restriction in tuplesort.h. >> > > I did, it's there in the file header comments. I see what you wrote in tuplesort.h here: > + * algorithm, and are typically only used for large amounts of data. Note > + * that parallel sorts is not support for random access to the sort result. This should say "...are not support when random access is requested". > Added handling for parallel_leader_participation as well as deleted > compile time option force_single_worker. I still see this: > + > +/* > + * A parallel sort with one worker process, and without any leader-as-worker > + * state may be used for testing the parallel tuplesort infrastructure. > + */ > +#ifdef NOT_USED > +#define FORCE_SINGLE_WORKER > +#endif Looks like you missed this FORCE_SINGLE_WORKER hunk -- please remove it, too. >> The parallel_leader_participation docs will also need to be updated. >> > > Done. I don't see this. There is no reference to parallel_leader_participation in the CREATE INDEX docs, nor is there a reference to CREATE INDEX in the parallel_leader_participation docs. > Also performed more testing with the patch, with > parallel_leader_participation > ON and OFF. Found one issue, where earlier we always used to call > _bt_leader_sort_as_worker() but now need to skip the call if > parallel_leader_participation > is OFF. Hmm. I think the local variable within _bt_heapscan() should go back. Its value should be directly taken from parallel_leader_participation assignment, once. There might be some bizarre circumstances where it is possible for the value of parallel_leader_participation to change in flight, causing a race condition: we start with the leader as a participant, and change our mind later within _bt_leader_sort_as_worker(), causing the whole CREATE INDEX to hang forever. Even if that's impossible, it seems like an improvement in style to go back to one local variable controlling everything. Style issue here: > + long start_block = file->numFiles * BUFFILE_SEG_SIZE; > + int newNumFiles = file->numFiles + source->numFiles; Shouldn't start_block conform to the surrounding camelCase style? Finally, two new thoughts on the patch, that are not responses to anything you did in v14: 1. Thomas' barrier abstraction was added by commit 1145acc7. I think that you should use a static barrier in tuplesort.c now, and rip out the ConditionVariable fields in the Sharedsort struct. It's only a slightly higher level of abstraction for tuplesort.c, which makes only a small difference given the simple requirements of tuplesort.c. However, I see no reason to not go that way if that's the new standard, which it is. This looks like it will be fairly easy. 2. Does the plan_create_index_workers() cost model need to account for parallel_leader_participation, too, when capping workers? I think that it does. The relevant planner code is: > + /* > + * Cap workers based on available maintenance_work_mem as needed. > + * > + * Note that each tuplesort participant receives an even share of the > + * total maintenance_work_mem budget. Aim to leave workers (where > + * leader-as-worker Tuplesortstate counts as a worker) with no less than > + * 32MB of memory. This leaves cases where maintenance_work_mem is set to > + * 64MB immediately past the threshold of being capable of launching a > + * single parallel worker to sort. > + */ > + sort_mem_blocks = (maintenance_work_mem * 1024L) / BLCKSZ; > + min_sort_mem_blocks = (32768L * 1024L) / BLCKSZ; > + while (parallel_workers > min_parallel_workers && > + sort_mem_blocks / (parallel_workers + 1) < min_sort_mem_blocks) > + parallel_workers--; This parallel CREATE INDEX planner code snippet is about the need to have low per-worker maintenance_work_mem availability prevent more parallel workers from being added to the number that we plan to launch. Each worker tuplesort state needs at least 32MB. We clearly need to do something here. While it's always true that "leader-as-worker Tuplesortstate counts as a worker" in v14, I think that it should only be true in the next revision of the patch when parallel_leader_participation is actually true (IOW, we should only add 1 to parallel_workers within the loop invariant in that case). The reason why we need to consider parallel_leader_participation within this plan_create_index_workers() code is simple: During execution, _bt_leader_sort_as_worker() uses "worker tuplesort states"/btshared->scantuplesortstates to determine how much of a share of maintenance_work_mem each worker tuplesort gets. Our planner code needs to take that into account, now that the nbtsort.c parallel_leader_participation behavior isn't just some obscure debug option. IOW, the planner code needs to be consistent with the nbtsort.c execution code. -- Peter Geoghegan
On Fri, Dec 8, 2017 at 1:57 PM, Peter Geoghegan <pg@bowt.ie> wrote: > 1. Thomas' barrier abstraction was added by commit 1145acc7. I think > that you should use a static barrier in tuplesort.c now, and rip out > the ConditionVariable fields in the Sharedsort struct. It's only a > slightly higher level of abstraction for tuplesort.c, which makes only > a small difference given the simple requirements of tuplesort.c. > However, I see no reason to not go that way if that's the new > standard, which it is. This looks like it will be fairly easy. I thought about this too. A static barrier seems ideal for it, except for one tiny detail. We'd initialise the barrier with the number of participants, and then after launching we get to find out how many workers were really launched using pcxt->nworkers_launched, which may be a smaller number. If it's a smaller number, we need to adjust the barrier to the smaller party size. We can't do that by calling BarrierDetach() n times, because Andres convinced me to assert that you didn't try to detach from a static barrier (entirely reasonably) and I don't really want a process to be 'detaching' on behalf of someone else anyway. So I think we'd need to add an extra barrier function that lets you change the party size of a static barrier. Yeah, that sounds like a contradiction... but it's not the same as the attach/detach workflow because static parties *start out attached*, which is a very important distinction (it means that client code doesn't have to futz about with phases, or in other words the worker doesn't have to consider the possibility that it started up late and missed all the action and the sort is finished). The tidiest way to provide this new API would, I think, be to change the internal function BarrierDetachImpl() to take a parameter n and reduce barrier->participants by that number, and then add a function BarrierForgetParticipants(barrier, n) [insert better name] and have it call BarrierDetachImpl(). Then the latter's assertion that !static_party could move out to BarrierDetach() and BarrierArriveAndDetach(). Alternatively, we could use the dynamic API (see earlier parentheses about phases). The end goal would be that code like this can use BarrierInit(&barrier, participants), then (if necessary) BarrierForgetParticipants(&barrier, nonstarters), and then they all just have to call BarrierArriveAndWait() at the right time and that's all. Nice and tidy. -- Thomas Munro http://www.enterprisedb.com
On Fri, Dec 8, 2017 at 2:23 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Dec 8, 2017 at 1:57 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> 1. Thomas' barrier abstraction was added by commit 1145acc7. I think >> that you should use a static barrier in tuplesort.c now, and rip out >> the ConditionVariable fields in the Sharedsort struct. > > ... So I think we'd need to add an extra barrier > function that lets you change the party size of a static barrier. Something like the attached (untested), which would allow _bt_begin_parallel() to call BarrierInit(&barrier, request + 1), then BarrierForgetParticipants(&barrier, request - pcxt->nworkers_launched), and then all the condition variable loop stuff can be replaced with a well placed call to BarrierArriveAndWait(&barrier, WAIT_EVENT_SOMETHING_SOMETHING). -- Thomas Munro http://www.enterprisedb.com
Attachment
Thanks for review.
On Fri, Dec 8, 2017 at 6:27 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> 0001-Add-parallel-B-tree-index-build-sorting_v14.patch
Cool. I'm glad that we now have a patch that applies cleanly against
master, while adding very little to buffile.c. It feels like we're
getting very close here.
>> * You didn't point out the randomAccess restriction in tuplesort.h.
>>
>
> I did, it's there in the file header comments.
I see what you wrote in tuplesort.h here:
> + * algorithm, and are typically only used for large amounts of data. Note
> + * that parallel sorts is not support for random access to the sort result.
This should say "...are not support when random access is requested".
Done.
> Added handling for parallel_leader_participation as well as deleted
> compile time option force_single_worker.
I still see this:
> +
> +/*
> + * A parallel sort with one worker process, and without any leader-as-worker
> + * state may be used for testing the parallel tuplesort infrastructure.
> + */
> +#ifdef NOT_USED
> +#define FORCE_SINGLE_WORKER
> +#endif
Looks like you missed this FORCE_SINGLE_WORKER hunk -- please remove it, too.
Done.
>> The parallel_leader_participation docs will also need to be updated.
>>
>
> Done.
I don't see this. There is no reference to
parallel_leader_participation in the CREATE INDEX docs, nor is there a
reference to CREATE INDEX in the parallel_leader_participation docs.
I thought parallel_leader_participation is generic GUC which get effect
for all parallel operation. isn't it? On that understanding I just update the
documentation of parallel_leader_participation into config.sgml to
make it more generalize.
> Also performed more testing with the patch, with
> parallel_leader_participation
> ON and OFF. Found one issue, where earlier we always used to call
> _bt_leader_sort_as_worker() but now need to skip the call if
> parallel_leader_participation
> is OFF.
Hmm. I think the local variable within _bt_heapscan() should go back.
Its value should be directly taken from parallel_leader_participation
assignment, once. There might be some bizarre circumstances where it
is possible for the value of parallel_leader_participation to change
in flight, causing a race condition: we start with the leader as a
participant, and change our mind later within
_bt_leader_sort_as_worker(), causing the whole CREATE INDEX to hang
forever.
Even if that's impossible, it seems like an improvement in style to go
back to one local variable controlling everything.
Yes, to me also it's looks kind of impossible situation but then too
it make sense to make one local variable and then always read the
value from that.
Style issue here:
> + long start_block = file->numFiles * BUFFILE_SEG_SIZE;
> + int newNumFiles = file->numFiles + source->numFiles;
Shouldn't start_block conform to the surrounding camelCase style?
Done.
Finally, two new thoughts on the patch, that are not responses to
anything you did in v14:
1. Thomas' barrier abstraction was added by commit 1145acc7. I think
that you should use a static barrier in tuplesort.c now, and rip out
the ConditionVariable fields in the Sharedsort struct. It's only a
slightly higher level of abstraction for tuplesort.c, which makes only
a small difference given the simple requirements of tuplesort.c.
However, I see no reason to not go that way if that's the new
standard, which it is. This looks like it will be fairly easy.
Pending, as per Thomas' explanation, it seems like need some more
work in the barrier APIs.
work in the barrier APIs.
2. Does the plan_create_index_workers() cost model need to account for
parallel_leader_participation, too, when capping workers? I think that
it does.
The relevant planner code is:
> + /*
> + * Cap workers based on available maintenance_work_mem as needed.
> + *
> + * Note that each tuplesort participant receives an even share of the
> + * total maintenance_work_mem budget. Aim to leave workers (where
> + * leader-as-worker Tuplesortstate counts as a worker) with no less than
> + * 32MB of memory. This leaves cases where maintenance_work_mem is set to
> + * 64MB immediately past the threshold of being capable of launching a
> + * single parallel worker to sort.
> + */
> + sort_mem_blocks = (maintenance_work_mem * 1024L) / BLCKSZ;
> + min_sort_mem_blocks = (32768L * 1024L) / BLCKSZ;
> + while (parallel_workers > min_parallel_workers &&
> + sort_mem_blocks / (parallel_workers + 1) < min_sort_mem_blocks)
> + parallel_workers--;
This parallel CREATE INDEX planner code snippet is about the need to
have low per-worker maintenance_work_mem availability prevent more
parallel workers from being added to the number that we plan to
launch. Each worker tuplesort state needs at least 32MB. We clearly
need to do something here.
While it's always true that "leader-as-worker Tuplesortstate counts as
a worker" in v14, I think that it should only be true in the next
revision of the patch when parallel_leader_participation is actually
true (IOW, we should only add 1 to parallel_workers within the loop
invariant in that case). The reason why we need to consider
parallel_leader_participation within this plan_create_index_workers()
code is simple: During execution, _bt_leader_sort_as_worker() uses
"worker tuplesort states"/btshared->scantuplesortstates to determine
how much of a share of maintenance_work_mem each worker tuplesort
gets. Our planner code needs to take that into account, now that the
nbtsort.c parallel_leader_participation behavior isn't just some
obscure debug option. IOW, the planner code needs to be consistent
with the nbtsort.c execution code.
Ah nice catch. I passed the local variable (leaderasworker) of _bt_heapscan()
to plan_create_index_workers() rather than direct reading value from the
parallel_leader_participation (reasons are same as you explained earlier).
parallel_leader_participation (reasons are same as you explained earlier).
Thanks,
Rushabh Lathia
Attachment
Hello Rushabh, On Fri, December 8, 2017 2:28 am, Rushabh Lathia wrote: > Thanks for review. > > On Fri, Dec 8, 2017 at 6:27 AM, Peter Geoghegan <pg@bowt.ie> wrote: > >> On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia >> <rushabh.lathia@gmail.com> wrote: >> > 0001-Add-parallel-B-tree-index-build-sorting_v14.patch I've looked only at patch 0002, here are some comments. > + * leaderasworker indicates whether leader going to participate as worker or > + * not. The grammar is a bit off, and the "or not" seems obvious. IMHO this could be: + * leaderasworker indicates whether the leader is going to participate as worker The argument leaderasworker is only used once and for one temp. variable that is only used once, too. So the temp. variable could maybe go. And not sure what the verdict was from the const-discussion threads, I did not follow it through. If "const" is what should be done generally, then the argument could be consted, as to not create more "unconsted" code. E.g. so: +plan_create_index_workers(Oid tableOid, Oid indexOid, const bool leaderasworker) and later: - sort_mem_blocks / (parallel_workers + 1) < min_sort_mem_blocks) + sort_mem_blocks / (parallel_workers + (leaderasworker ? 1 : 0)) < min_sort_mem_blocks) Thank you for working on this patch! All the best, Tels
On Thu, Dec 7, 2017 at 11:28 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > I thought parallel_leader_participation is generic GUC which get effect > for all parallel operation. isn't it? On that understanding I just update > the > documentation of parallel_leader_participation into config.sgml to > make it more generalize. Okay. I'm not quite sure how to fit parallel_leader_participation into parallel CREATE INDEX (see my remarks on that below). I see a new bug in the patch (my own bug). Which is: the CONCURRENTLY case should obtain a RowExclusiveLock on the index relation within _bt_worker_main(), not an AccessExclusiveLock. That's all the leader has at that point within CREATE INDEX CONCURRENTLY. I now believe that index_create() should reject catalog parallel CREATE INDEX directly, just as it does for catalog CREATE INDEX CONCURRENTLY. That logic should be generic to all AMs, since the reasons for disallowing catalog parallel index builds are generic. On a similar note, *maybe* we should even call plan_create_index_workers() from index_create() (or at least some point within index.c). You're going to need a new field or two within IndexInfo for this, beside ii_Concurrent/ii_BrokenHotChain (next to the other stuff that is only used during index builds). Maybe ii_ParallelWorkers, and ii_LeaderAsWorker. What do you think of this suggestion? It's probably neater overall...though I'm less confident that this one is an improvement. Note that cluster.c calls plan_cluster_use_sort() directly, while checking "OldIndex->rd_rel->relam == BTREE_AM_OID" as a prerequisite to calling it. This seems like it might be considered an example that we should follow within index.c -- plan_create_index_workers() is based on plan_cluster_use_sort(). > Yes, to me also it's looks kind of impossible situation but then too > it make sense to make one local variable and then always read the > value from that. I think that it probably is technically possible, though the user would have to be doing something insane for it to be a problem. As I'm sure you understand, it's simpler to eliminate the possibility than it is to reason about it never happening. >> 1. Thomas' barrier abstraction was added by commit 1145acc7. I think >> that you should use a static barrier in tuplesort.c now, and rip out >> the ConditionVariable fields in the Sharedsort struct. > Pending, as per Thomas' explanation, it seems like need some more > work in the barrier APIs. Okay. It's not the case that parallel tuplesort would significantly benefit from using the barrier abstraction, so I don't think we need to consider this a blocker to commit. My concern is mostly just that everyone is on the same page with barriers. > Ah nice catch. I passed the local variable (leaderasworker) of > _bt_heapscan() > to plan_create_index_workers() rather than direct reading value from the > parallel_leader_participation (reasons are same as you explained earlier). Cool. I don't think that this should be a separate patch -- please rebase + squash. Do you think that the main part of the cost model needs to care about parallel_leader_participation, too? compute_parallel_worker() assumes that the caller is planning a parallel-sequential-scan-alike thing, in the sense that the leader only acts like a worker in cases that probably don't have many workers, where the leader cannot keep itself busy as a leader. That's actually quite different to parallel CREATE INDEX, because the leader-as-worker state will behave in exactly the same way as a worker would, no matter how many workers there are. The leader process is guaranteed to give its full attention to being a worker, because it has precisely nothing else to do until workers finish. This makes me think that we may need to immediately do something with the result of compute_parallel_worker(), to consider whether or not a leader-as-worker state should be used, despite the fact that no existing compute_parallel_worker() caller does anything like this. -- Peter Geoghegan
Thanks Tels for reviewing the patch.
On Fri, Dec 8, 2017 at 2:54 PM, Tels <nospam-pg-abuse@bloodgate.com> wrote:
Hello Rushabh,
On Fri, December 8, 2017 2:28 am, Rushabh Lathia wrote:
> Thanks for review.
>
> On Fri, Dec 8, 2017 at 6:27 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>
>> On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia
>> <rushabh.lathia@gmail.com> wrote:
>> > 0001-Add-parallel-B-tree-index-build-sorting_v14.patch
I've looked only at patch 0002, here are some comments.
> + * leaderasworker indicates whether leader going to participate as
worker or
> + * not.
The grammar is a bit off, and the "or not" seems obvious. IMHO this could be:
+ * leaderasworker indicates whether the leader is going to participate as
worker
Sure.
The argument leaderasworker is only used once and for one temp. variable
that is only used once, too. So the temp. variable could maybe go.
And not sure what the verdict was from the const-discussion threads, I did
not follow it through. If "const" is what should be done generally, then
the argument could be consted, as to not create more "unconsted" code.
E.g. so:
+plan_create_index_workers(Oid tableOid, Oid indexOid, const bool
leaderasworker)
Make sense.
and later:
- sort_mem_blocks / (parallel_workers + 1) <
min_sort_mem_blocks)
+ sort_mem_blocks / (parallel_workers + (leaderasworker
? 1 : 0)) < min_sort_mem_blocks)
Even I didn't liked to take a extra variable, but then code looks bit
unreadable - so rather then making difficult to read the code - I thought
of adding new variable.
Thank you for working on this patch!
I will address review comments in the next set of patches.
Regards,
Rushabh Lathia
On Sun, Dec 10, 2017 at 3:06 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Dec 7, 2017 at 11:28 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> I thought parallel_leader_participation is generic GUC which get effect
> for all parallel operation. isn't it? On that understanding I just update
> the
> documentation of parallel_leader_participation into config.sgml to
> make it more generalize.
Okay. I'm not quite sure how to fit parallel_leader_participation into
parallel CREATE INDEX (see my remarks on that below).
I see a new bug in the patch (my own bug). Which is: the CONCURRENTLY
case should obtain a RowExclusiveLock on the index relation within
_bt_worker_main(), not an AccessExclusiveLock. That's all the leader
has at that point within CREATE INDEX CONCURRENTLY.
Oh right. I also missed to test that earlier. Fixed now.
I now believe that index_create() should reject catalog parallel
CREATE INDEX directly, just as it does for catalog CREATE INDEX
CONCURRENTLY. That logic should be generic to all AMs, since the
reasons for disallowing catalog parallel index builds are generic.
Sorry I didn't get this, reject means? you mean it should throw an error
catalog parallel CREATE INDEX? or just suggesting to set the
catalog parallel CREATE INDEX? or just suggesting to set the
ParallelWorkers and may be LeaderAsWorker from index_create()
or may be index_build()?
On a similar note, *maybe* we should even call
plan_create_index_workers() from index_create() (or at least some
point within index.c). You're going to need a new field or two within
IndexInfo for this, beside ii_Concurrent/ii_BrokenHotChain (next to
the other stuff that is only used during index builds). Maybe
ii_ParallelWorkers, and ii_LeaderAsWorker. What do you think of this
suggestion? It's probably neater overall...though I'm less confident
that this one is an improvement.
Note that cluster.c calls plan_cluster_use_sort() directly, while
checking "OldIndex->rd_rel->relam == BTREE_AM_OID" as a prerequisite
to calling it. This seems like it might be considered an example that
we should follow within index.c -- plan_create_index_workers() is
based on plan_cluster_use_sort().
> Yes, to me also it's looks kind of impossible situation but then too
> it make sense to make one local variable and then always read the
> value from that.
I think that it probably is technically possible, though the user
would have to be doing something insane for it to be a problem. As I'm
sure you understand, it's simpler to eliminate the possibility than it
is to reason about it never happening.
yes.
>> 1. Thomas' barrier abstraction was added by commit 1145acc7. I think
>> that you should use a static barrier in tuplesort.c now, and rip out
>> the ConditionVariable fields in the Sharedsort struct.
> Pending, as per Thomas' explanation, it seems like need some more
> work in the barrier APIs.
Okay. It's not the case that parallel tuplesort would significantly
benefit from using the barrier abstraction, so I don't think we need
to consider this a blocker to commit. My concern is mostly just that
everyone is on the same page with barriers.
True, if needed, this can be also done later on.
> Ah nice catch. I passed the local variable (leaderasworker) of
> _bt_heapscan()
> to plan_create_index_workers() rather than direct reading value from the
> parallel_leader_participation (reasons are same as you explained earlier).
Cool. I don't think that this should be a separate patch -- please
rebase + squash.
Sure, done.
Do you think that the main part of the cost model needs to care about
parallel_leader_participation, too?
compute_parallel_worker() assumes that the caller is planning a
parallel-sequential-scan-alike thing, in the sense that the leader
only acts like a worker in cases that probably don't have many
workers, where the leader cannot keep itself busy as a leader. That's
actually quite different to parallel CREATE INDEX, because the
leader-as-worker state will behave in exactly the same way as a worker
would, no matter how many workers there are. The leader process is
guaranteed to give its full attention to being a worker, because it
has precisely nothing else to do until workers finish. This makes me
think that we may need to immediately do something with the result of
compute_parallel_worker(), to consider whether or not a
leader-as-worker state should be used, despite the fact that no
existing compute_parallel_worker() caller does anything like this.
I agree with you. compute_parallel_worker() mainly design for the
scan-alike things. Where as parallel create index is different in a
sense where leader has as much power as worker. But at the same
time I don't see any side effect or negative of that with PARALLEL
CREATE INDEX. So I am more towards not changing that aleast
for now - as part of this patch.
Thanks for review.
Regards, Rushabh Lathia
Attachment
On Tue, Dec 12, 2017 at 2:09 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: >> I now believe that index_create() should reject catalog parallel >> CREATE INDEX directly, just as it does for catalog CREATE INDEX >> CONCURRENTLY. That logic should be generic to all AMs, since the >> reasons for disallowing catalog parallel index builds are generic. >> > > Sorry I didn't get this, reject means? you mean it should throw an error > catalog parallel CREATE INDEX? or just suggesting to set the > ParallelWorkers and may be LeaderAsWorker from index_create() > or may be index_build()? I mean that we should be careful to make sure that AM-generic parallel CREATE INDEX logic does not end up in a specific AM (nbtree). The patch *already* refuses to perform a parallel CREATE INDEX on a system catalog, which is what I meant by reject (sorry for being unclear). The point is that that's due to a restriction that has nothing to do with nbtree in particular (just like the CIC restriction on catalogs), so it should be performed within index_build(). Just like the similar CONCURRENTLY-on-a-catalog restriction, though without throwing an error, since of course the user doesn't explicitly ask for a parallel CREATE INDEX at any point (unlike CONCURRENTLY). Once we go this way, the cost model has to be called at that point, too. We already have the AM-specific "OldIndex->rd_rel->relam == BTREE_AM_OID" tests within cluster.c, even though theoretically another AM might be involved with CLUSTER in the future, which this seems similar to. So, I propose the following (this is a rough outline): * Add new IndexInfo files after ii_Concurrent/ii_BrokenHotChain -- ii_ParallelWorkers and ii_LeaderAsWorker. * Call plan_create_index_workers() within index_create(), assigning to ii_ParallelWorkers, and fill in ii_LeaderAsWorker from the parallel_leader_participation GUC. Add comments along the lines of "only nbtree supports parallel builds". Test the index with a "heapRelation->rd_rel->relam == BTREE_AM_OID" to make this work. Otherwise, assign zero to ii_ParallelWorkers (and leave ii_LeaderAsWorker as false). * For builds on catalogs, or builds using other AMs, don't let parallelism go ahead by immediately assigning zero to ii_ParallelWorkers within index_create(), near where the similar CIC test occurs already. What do you think of that? >> Do you think that the main part of the cost model needs to care about >> parallel_leader_participation, too? >> >> compute_parallel_worker() assumes that the caller is planning a >> parallel-sequential-scan-alike thing, in the sense that the leader >> only acts like a worker in cases that probably don't have many >> workers, where the leader cannot keep itself busy as a leader. That's >> actually quite different to parallel CREATE INDEX, because the >> leader-as-worker state will behave in exactly the same way as a worker >> would, no matter how many workers there are. The leader process is >> guaranteed to give its full attention to being a worker, because it >> has precisely nothing else to do until workers finish. This makes me >> think that we may need to immediately do something with the result of >> compute_parallel_worker(), to consider whether or not a >> leader-as-worker state should be used, despite the fact that no >> existing compute_parallel_worker() caller does anything like this. >> > > I agree with you. compute_parallel_worker() mainly design for the > scan-alike things. Where as parallel create index is different in a > sense where leader has as much power as worker. But at the same > time I don't see any side effect or negative of that with PARALLEL > CREATE INDEX. So I am more towards not changing that aleast > for now - as part of this patch. I've also noticed is that there is little to no negative effect on CREATE INDEX duration from adding new workers past the point where adding more workers stops making the build faster. It's quite clear. And, in general, there isn't all that much theoretical justification for the cost model (it's essentially the same as any other parallel scan), which doesn't seem to matter much. So, I agree that it doesn't really matter in practice, but disagree that it should not still be changed -- the justification may be a little thin, but I think that we need to stick to it. There should be a theoretical justification for the cost model that is coherent in the wider context of costs models for parallelism in general. It should not be arbitrarily inconsistent just because it apparently doesn't matter that much. It's easy to fix -- let's just fix it. -- Peter Geoghegan
On Sun, Dec 31, 2017 at 9:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Dec 12, 2017 at 2:09 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
>> I now believe that index_create() should reject catalog parallel
>> CREATE INDEX directly, just as it does for catalog CREATE INDEX
>> CONCURRENTLY. That logic should be generic to all AMs, since the
>> reasons for disallowing catalog parallel index builds are generic.
>>
>
> Sorry I didn't get this, reject means? you mean it should throw an error
> catalog parallel CREATE INDEX? or just suggesting to set the
> ParallelWorkers and may be LeaderAsWorker from index_create()
> or may be index_build()?
I mean that we should be careful to make sure that AM-generic parallel
CREATE INDEX logic does not end up in a specific AM (nbtree).
Ah okay, that's what I thought.
The patch *already* refuses to perform a parallel CREATE INDEX on a
system catalog, which is what I meant by reject (sorry for being
unclear). The point is that that's due to a restriction that has
nothing to do with nbtree in particular (just like the CIC restriction
on catalogs), so it should be performed within index_build(). Just
like the similar CONCURRENTLY-on-a-catalog restriction, though without
throwing an error, since of course the user doesn't explicitly ask for
a parallel CREATE INDEX at any point (unlike CONCURRENTLY).
Once we go this way, the cost model has to be called at that point,
too. We already have the AM-specific "OldIndex->rd_rel->relam ==
BTREE_AM_OID" tests within cluster.c, even though theoretically
another AM might be involved with CLUSTER in the future, which this
seems similar to.
So, I propose the following (this is a rough outline):
* Add new IndexInfo files after ii_Concurrent/ii_BrokenHotChain --
ii_ParallelWorkers and ii_LeaderAsWorker.
* Call plan_create_index_workers() within index_create(), assigning to
ii_ParallelWorkers, and fill in ii_LeaderAsWorker from the
parallel_leader_participation GUC. Add comments along the lines of
"only nbtree supports parallel builds". Test the index with a
"heapRelation->rd_rel->relam == BTREE_AM_OID" to make this work.
Otherwise, assign zero to ii_ParallelWorkers (and leave
ii_LeaderAsWorker as false).
* For builds on catalogs, or builds using other AMs, don't let
parallelism go ahead by immediately assigning zero to
ii_ParallelWorkers within index_create(), near where the similar CIC
test occurs already.
What do you think of that?
Need to do after the indexRelation build. So I added after update of pg_index,
as indexRelation needed for plan_create_index_worders().
Attaching the separate patch the same.
>> Do you think that the main part of the cost model needs to care about
>> parallel_leader_participation, too?
>>
>> compute_parallel_worker() assumes that the caller is planning a
>> parallel-sequential-scan-alike thing, in the sense that the leader
>> only acts like a worker in cases that probably don't have many
>> workers, where the leader cannot keep itself busy as a leader. That's
>> actually quite different to parallel CREATE INDEX, because the
>> leader-as-worker state will behave in exactly the same way as a worker
>> would, no matter how many workers there are. The leader process is
>> guaranteed to give its full attention to being a worker, because it
>> has precisely nothing else to do until workers finish. This makes me
>> think that we may need to immediately do something with the result of
>> compute_parallel_worker(), to consider whether or not a
>> leader-as-worker state should be used, despite the fact that no
>> existing compute_parallel_worker() caller does anything like this.
>>
>
> I agree with you. compute_parallel_worker() mainly design for the
> scan-alike things. Where as parallel create index is different in a
> sense where leader has as much power as worker. But at the same
> time I don't see any side effect or negative of that with PARALLEL
> CREATE INDEX. So I am more towards not changing that aleast
> for now - as part of this patch.
I've also noticed is that there is little to no negative effect on
CREATE INDEX duration from adding new workers past the point where
adding more workers stops making the build faster. It's quite clear.
And, in general, there isn't all that much theoretical justification
for the cost model (it's essentially the same as any other parallel
scan), which doesn't seem to matter much. So, I agree that it doesn't
really matter in practice, but disagree that it should not still be
changed -- the justification may be a little thin, but I think that we
need to stick to it. There should be a theoretical justification for
the cost model that is coherent in the wider context of costs models
for parallelism in general. It should not be arbitrarily inconsistent
just because it apparently doesn't matter that much. It's easy to fix
-- let's just fix it.
So you suggesting that need to do adjustment with the output of
compute_parallel_worker() by considering parallel_leader_participation?
compute_parallel_worker() by considering parallel_leader_participation?
Thanks,
Rushabh Lathia
Attachment
On Tue, Jan 2, 2018 at 1:38 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > Need to do after the indexRelation build. So I added after update of > pg_index, > as indexRelation needed for plan_create_index_worders(). > > Attaching the separate patch the same. This made it so that REINDEX and CREATE INDEX CONCURRENTLY no longer used parallelism. I think we need to do this very late, just before nbtree's ambuild() routine is called from index.c. > So you suggesting that need to do adjustment with the output of > compute_parallel_worker() by considering parallel_leader_participation? We know for sure that there is no reason to not use the leader process as a worker process in the case of parallel CREATE INDEX. So we must not have the number of participants (i.e. worker Tuplesortstates) vary based on the current parallel_leader_participation setting. While parallel_leader_participation can affect the number of worker processes requested, that's a different thing. There is no question about parallel_leader_participation ever being relevant to performance -- it's strictly a testing option for us. Even after parallel_leader_participation was added, compute_parallel_worker() still assumes that the sequential scan leader is always too busy to help. compute_parallel_worker() seems to think that that's something that the leader does in "rare" cases not worth considering -- cases where it has no worker tuples to consume (maybe I'm reading too much into it not caring about parallel_leader_participation, but I don't think so). If compute_parallel_worker()'s assumption was questionable before, it's completely wrong for parallel CREATE INDEX. I think plan_create_index_workers() needs to count the leader-as-worker as an ordinary worker, not special in any way by deducting one worker from what compute_parallel_worker() returns. (This only happens when it's necessary to compensate -- when leader-as-worker participation is going to go ahead.) I'm working on fixing up what you posted. I'm probably not more than a week away from posting a patch that I'm going to mark "ready for committer". I've already made the change above, and once I spend time on trying to break the few small changes needed within buffile.c I'll have taken it as far as I can, most likely. -- Peter Geoghegan
On Wed, Jan 3, 2018 at 9:11 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Jan 2, 2018 at 1:38 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> Need to do after the indexRelation build. So I added after update of
> pg_index,
> as indexRelation needed for plan_create_index_worders().
>
> Attaching the separate patch the same.
This made it so that REINDEX and CREATE INDEX CONCURRENTLY no longer
used parallelism. I think we need to do this very late, just before
nbtree's ambuild() routine is called from index.c.
Ahh right. We should move the plan_create_index_workers() call to
index_build() before the ambuild().
index_build() before the ambuild().
> So you suggesting that need to do adjustment with the output of
> compute_parallel_worker() by considering parallel_leader_participation?
We know for sure that there is no reason to not use the leader process
as a worker process in the case of parallel CREATE INDEX. So we must
not have the number of participants (i.e. worker Tuplesortstates) vary
based on the current parallel_leader_participation setting. While
parallel_leader_participation can affect the number of worker
processes requested, that's a different thing. There is no question
about parallel_leader_participation ever being relevant to performance
-- it's strictly a testing option for us.
Even after parallel_leader_participation was added,
compute_parallel_worker() still assumes that the sequential scan
leader is always too busy to help. compute_parallel_worker() seems to
think that that's something that the leader does in "rare" cases not
worth considering -- cases where it has no worker tuples to consume
(maybe I'm reading too much into it not caring about
parallel_leader_participation, but I don't think so). If
compute_parallel_worker()'s assumption was questionable before, it's
completely wrong for parallel CREATE INDEX. I think
plan_create_index_workers() needs to count the leader-as-worker as an
ordinary worker, not special in any way by deducting one worker from
what compute_parallel_worker() returns. (This only happens when it's
necessary to compensate -- when leader-as-worker participation is
going to go ahead.)
Yes, event with parallel_leader_participation - compute_parallel_worker()
doesn't take that into consideration. Or may be the assumption is that
launch the number of workers return by the compute_parallel_worker(),
irrespective of whether leader is going to participate in a scan or not.
I agree that plan_create_index_workers() needs to count the leader as a
normal worker for the CREATE INDEX. So what you proposing is - when
parallel_leader_participation is true launch (return value of compute_parallel_worker() - 1)
parallel_leader_participation is true launch (return value of compute_parallel_worker() - 1)
workers. true ?
I'm working on fixing up what you posted. I'm probably not more than a
week away from posting a patch that I'm going to mark "ready for
committer". I've already made the change above, and once I spend time
on trying to break the few small changes needed within buffile.c I'll
have taken it as far as I can, most likely.
Okay, once you submit the patch with changes - I will do one round of
review for the changes.
Thanks,
Rushabh Lathia
On Tue, Jan 2, 2018 at 8:43 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > I agree that plan_create_index_workers() needs to count the leader as a > normal worker for the CREATE INDEX. So what you proposing is - when > parallel_leader_participation is true launch (return value of > compute_parallel_worker() - 1) > workers. true ? Almost. We need to not subtract one when only one worker is indicated by compute_parallel_worker(). I also added some new stuff there, to consider edge cases with the parallel_leader_participation GUC. >> I'm working on fixing up what you posted. I'm probably not more than a >> week away from posting a patch that I'm going to mark "ready for >> committer". I've already made the change above, and once I spend time >> on trying to break the few small changes needed within buffile.c I'll >> have taken it as far as I can, most likely. >> > > Okay, once you submit the patch with changes - I will do one round of > review for the changes. I've attached my revision. Changes include: * Changes to plan_create_index_workers() were made along the lines recently discussed. * plan_create_index_workers() is now called right before the ambuild routine is called (nbtree index builds only, of course). * Significant overhaul of tuplesort.h contract. This had references to the old approach, and to tqueue.c's tuple descriptor thing that was since superseded by the typmod registry added for parallel hash join. These were updated/removed. * Both tuplesort.c and logtape.c now say that they cannot write to the writable/last tape, while still acknowledging that it is in fact the leader tape, and that this restriction is due to a restriction with BufFiles. They also point out that if the restriction within buffile.c ever was removed, everything would work fine. * Added new call to BufFileExportShared() when freezing tape in logtape.c. * Tweaks to documentation. * pgindent ran on modified files. * Polished the stuff that is added to buffile.c. Mostly comments that clarify its reason for existing. Also added Assert()s. Note that I added Heikki as an author in the commit message. Technically, Heikki didn't actually write code for parallel CREATE INDEX, but he did loads of independently useful work on merging + temp file I/O that went into Postgres 10 (though this wasn't listed in the v10 release notes). That work was done in large part to help the parallel CREATE INDEX patch, and it did in fact help it quite noticeably, so I think that this is warranted. Remember that with parallel CREATE INDEX, the leader's merge occurs serially, so anything that we can do to speed that part up is very helpful. This revision does seem very close, but I'll hold off on changing the status of the patch for a few more days, to give you time to give some feedback. -- Peter Geoghegan
Attachment
On Sat, Jan 6, 2018 at 3:47 AM, Peter Geoghegan <pg@bowt.ie> wrote:
Thanks, On Tue, Jan 2, 2018 at 8:43 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
> I agree that plan_create_index_workers() needs to count the leader as a
> normal worker for the CREATE INDEX. So what you proposing is - when
> parallel_leader_participation is true launch (return value of
> compute_parallel_worker() - 1)
> workers. true ?
Almost. We need to not subtract one when only one worker is indicated
by compute_parallel_worker(). I also added some new stuff there, to
consider edge cases with the parallel_leader_participation GUC.
>> I'm working on fixing up what you posted. I'm probably not more than a
>> week away from posting a patch that I'm going to mark "ready for
>> committer". I've already made the change above, and once I spend time
>> on trying to break the few small changes needed within buffile.c I'll
>> have taken it as far as I can, most likely.
>>
>
> Okay, once you submit the patch with changes - I will do one round of
> review for the changes.
I've attached my revision. Changes include:
* Changes to plan_create_index_workers() were made along the lines
recently discussed.
* plan_create_index_workers() is now called right before the ambuild
routine is called (nbtree index builds only, of course).
* Significant overhaul of tuplesort.h contract. This had references to
the old approach, and to tqueue.c's tuple descriptor thing that was
since superseded by the typmod registry added for parallel hash join.
These were updated/removed.
* Both tuplesort.c and logtape.c now say that they cannot write to the
writable/last tape, while still acknowledging that it is in fact the
leader tape, and that this restriction is due to a restriction with
BufFiles. They also point out that if the restriction within buffile.c
ever was removed, everything would work fine.
* Added new call to BufFileExportShared() when freezing tape in logtape.c.
* Tweaks to documentation.
* pgindent ran on modified files.
* Polished the stuff that is added to buffile.c. Mostly comments that
clarify its reason for existing. Also added Assert()s.
Note that I added Heikki as an author in the commit message.
Technically, Heikki didn't actually write code for parallel CREATE
INDEX, but he did loads of independently useful work on merging + temp
file I/O that went into Postgres 10 (though this wasn't listed in the
v10 release notes). That work was done in large part to help the
parallel CREATE INDEX patch, and it did in fact help it quite
noticeably, so I think that this is warranted. Remember that with
parallel CREATE INDEX, the leader's merge occurs serially, so anything
that we can do to speed that part up is very helpful.
This revision does seem very close, but I'll hold off on changing the
status of the patch for a few more days, to give you time to give some
feedback.
Thanks Peter for the updated patch.
I gone through the changes and perform the basic testing. Changes
looks good and haven't found any unusual during testing
Rushabh Lathia
On Mon, Jan 8, 2018 at 9:44 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > I gone through the changes and perform the basic testing. Changes > looks good and haven't found any unusual during testing Then I'll mark the patch "Ready for Committer" now. I think that we've done just about all we can with it. There is one lingering concern that I cannot shake, which stems from the fact that the cost model (plan_create_index_workers()) follows the same generic logic for adding workers as parallel sequential scan, per Robert's feedback from around March of last year (that is, we more or less just reuse compute_parallel_worker()). My specific concern is that this approach may be too aggressive in situations where a parallel external sort ends up being used instead of a serial internal sort. No weight is given to any extra temp file costs; a serial external sort is, in a sense, the baseline, including in cases where the table is very small and an external sort can actually easily be avoided iff we do a serial sort. This is probably not worth doing anything about. The distinction between internal and external sorts became rather blurred in 9.6 and 10, which, in a way, this patch builds on. If what I describe is a problem at all, it will very probably only be a problem on small CREATE INDEX operations, where linear sequential I/O costs are not already dwarfed by the linearithmic CPU costs. (The dominance of CPU/comparison costs on larger sorts is the main reason why external sorts can be faster than internal sorts -- this happens fairly frequently these days, especially with CREATE INDEX, where being able to write out the index as it merges on-the-fly helps a lot.) -- Peter Geoghegan
On Sat, Jan 6, 2018 at 11:17 AM, Peter Geoghegan <pg@bowt.ie> wrote: > * Significant overhaul of tuplesort.h contract. This had references to > the old approach, and to tqueue.c's tuple descriptor thing that was > since superseded by the typmod registry added for parallel hash join. > These were updated/removed. +1 > * Both tuplesort.c and logtape.c now say that they cannot write to the > writable/last tape, while still acknowledging that it is in fact the > leader tape, and that this restriction is due to a restriction with > BufFiles. They also point out that if the restriction within buffile.c > ever was removed, everything would work fine. +1 > * Added new call to BufFileExportShared() when freezing tape in logtape.c. +1 > * Polished the stuff that is added to buffile.c. Mostly comments that > clarify its reason for existing. Also added Assert()s. +1 This looks good to me. -- Thomas Munro http://www.enterprisedb.com
On Tue, Jan 9, 2018 at 10:36 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > This looks good to me. The addition to README.parallel is basically wrong, because workers have been allowed to write WAL since the ParallelContext machinery. See the XactLastRecEnd handling in parallel.c. Workers can, for example, due HOT cleanups during SELECT scans, just as the leader can. The language here is obsolete anyway in light of commit e9baa5e9fa147e00a2466ab2c40eb99c8a700824, but this isn't the right way to update it. I'll propose a separate patch for that. The change to the ParallelContext signature in parallel.h makes an already-overlength line even longer. A line break seems warranted just after the first argument, plus pgindent afterward. I am not a fan of the leader-as-worker terminology. The leader is not a worker, full stop. I think we should instead talk about whether the leader participates (so, ii_LeaderAsWorker -> ii_LeaderParticipates, for example, plus many comment updates). Similarly, it seems SortCoordinateData's nLaunched should be nParticipants, and BTLeader's nworkertuplesorts should be nparticipanttuplesorts. There is also the question of whether we want to respect parallel_leader_participation in this context. The issues which might motivate the desire for such behavior in the context of a query do not exist when creating a btree index, so maybe we're just making this complicated. On the other hand, if some other type of parallel index build does end up doing a Gather-like operation then we might regret deciding that parallel_leader_participation doesn't apply to index builds, so maybe it's OK the way we have it. On the third hand, the complexity of having the leader maybe-participate seems like it extends to a fair number of places in the code, and getting rid of all that complexity seems appealing. One place where this actually causes a problem is the message changes to index_build(). The revised ereport() violates translatability guidelines, which require that messages not be assembled from pieces. See https://www.postgresql.org/docs/devel/static/nls-programmer.html#NLS-GUIDELINES A comment added to tuplesort.h says that work_mem should be at least 64KB, but does not give any reason. I think one should be given, at least briefly, so that someone looking at these comments in the future can, for example, figure out whether the comment is still correct after future code changes. Or else, remove the comment. + * Parallel sort callers are required to coordinate multiple tuplesort states + * in a leader process, and one or more worker processes. The leader process I think the comma should be removed. As written it, it looks like we are coordinating multiple tuplesort states in a leader process, and, separately, we are coordinating one or more worker processes. But in fact we are coordinating multiple tuplesort states which are in a group of processes that includes the leader and one or more worker processes. Generally, I think the comments in tuplesort.h are excellent. I really like the overview of how the new interfaces should be used, although I find it slightly wonky that the leader needs two separate Tuplesortstates if it wants to participate. I don't understand why this patch needs to tinker with the tests in vacuum.sql. The comments say that "If we did not do this, errors raised would concern running ANALYZE in parallel mode." However, why should parallel CREATE INDEX having any impact on ANALYZE at all? Also, as a practical matter, if I revert those changes, 'make check' still passes with or without force_parallel_mode=on. I really dislike the fact that this patch invents another thing for force_parallel_mode to do. I invented force_parallel_mode mostly as a way of testing that functions were correctly labeled for parallel-safety, and I think it would be just fine if it never does anything else. As it is, it already does two quite separate things to accomplish that goal: (1) forcibly run the whole plan with parallel mode restrictions enabled, provided that the plan is not parallel-unsafe, and (2) runs the plan in a worker, provided that the plan is parallel-safe. There's a subtle difference between those two condition, which is that not parallel-unsafe does not equal parallel-safe; there is also parallel-restricted. The fact that force_parallel_mode controls two different behaviors has, I think, already caused some confusion for prominent PostgreSQL developers and, likely, users as well. Making it do a third thing seems to me to be adding to the confusion, and not only because there are no documentation changes to match. If we go down this road, there will probably be more additions -- what happens when parallel VACUUM arrives, or parallel CLUSTER, or whatever? I don't think it will be a good thing for PostgreSQL if we end up with force_parallel_mode=on as a general "use parallelism even though it's stupid" flag, requiring supporting code in many different places throughout the code base and a laundry list of not-actually-useful behavior changes in the documentation. What I think would be a lot more useful, and what I sort of expected the patch to have, is a way for a user to explicitly control the number of workers requested for a CREATE INDEX operation. We all know that the cost model is crude and that may be OK -- though it would be interesting to see some research on what the run times actually look like for various numbers of workers at various table sizes and work_mem settings -- but it will be inconvenient for DBAs who actually know what number of workers they want to use to instead get whatever value plan_create_index_workers() decide to emit. They can force it by setting the parallel_workers reloption, but that affects queries. They can probably also do it by setting min_parallel_table_scan_size = 0 and max_parallel_workers_maintenance to whatever value they want, but I think it would be convenient for there to be a more straightforward way to do it, or at least some documentation in the CREATE INDEX page about how to get the number of workers you really want. To be clear, I don't think that this is a must-fix issue for this patch to get committed, but I do think that all reference to force_parallel_mode=on should go away. I do not like the way that this patch wants to turn the section of the documentation on when parallel query can be used into a discussion of when parallelism can be used. I think it would be better to leave that section alone and instead document under CREATE INDEX the concerns specific to parallel index build. I think this will be easier for users to understand and far easier to maintain as the number of parallel DDL operations increases, which I expect it to do somewhat explosively. The patch as written says things like "If a utility statement that is expected to do so does not produce a parallel plan, ..." but, one, utility statements *do not produce plans of any type* and, two, the concerns here are really specific to parallel CREATE INDEX and there is every reason to think that they might be different in other cases. I feel strongly that it's enough for this section to try to explain the concerns that pertain to optimizable queries and leave utility commands to be treated elsewhere. If we find that we're accumulating a lot of documentation for various parallel utility commands that seems to be duplicative, we can write a general treatment of that topic that is separate from this one. The documentation for max_parallel_workers_maintenance cribs from the documentation for max_parallel_workers_per_gather in saying that we'll use fewer workers than expected "which may be inefficient". However, for parallel CREATE INDEX, that trailing clause is, at least as far as I can see, not applicable. For a query, we might choose a Gather over a Parallel Seq Scan because we think we've got a lot of workers; with only one participant, we might prefer a GIN index scan. If it turns out we don't get the workers, we've got a clearly suboptimal plan. For CREATE INDEX, though, it seems to me that we don't make any decisions based on the number of workers we think we'll have. If we get fewer workers, it may be slower, but it shouldn't still be as fast as it can be with that number of workers, which for queries is not the case. + * These fields are not modified throughout the sort. They primarily + * exist for the benefit of worker processes, that need to create BTSpool + * state corresponding to that used by the leader. throughout -> during remove comma + * builds, that must work just the same when an index is built in remove comma + * State that is aggregated by workers, to report back to leader. State that is maintained by workers and reported back to leader. + * indtuples is the total number of tuples that made it into index. into the index + * btleader is only present when a parallel index build is performed, and + * only in leader process (actually, only the leader has a BTBuildState. + * Workers have their own spool and spool2, though.) the leader process period after "process" capitalize actually + * Done. Leave a way for leader to determine we're finished. Record how + * many tuples were in this worker's share of the relation. I don't understand what the "Leave a way" comment means. + * To support parallel sort operations involving coordinated callers to + * tuplesort.c routines across multiple workers, it is necessary to + * concatenate each worker BufFile/tapeset into one single leader-wise + * logical tapeset. Workers should have produced one final materialized + * tape (their entire output) when this happens in leader; there will always + * be the same number of runs as input tapes, and the same number of input + * tapes as workers. I can't interpret the word "leader-wise". A partition-wise join is a join done one partition at a time, but a leader-wise logical tape set is not done one leader at a time. If there's another meaning to the affix -wise, I'm not familiar with it. Don't we just mean "a single logical tapeset managed by the leader"? There's a lot here I haven't grokked yet, but I'm running out of mental energy so I think I'll send this for now and work on this some more when time permits, hopefully tomorrow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Jan 10, 2018, at 21:45, Robert Haas <robertmhaas@gmail.com> wrote: > > The documentation for max_parallel_workers_maintenance cribs from the > documentation for max_parallel_workers_per_gather in saying that we'll > use fewer workers than expected "which may be inefficient". Can we actually call it max_parallel_maintenance_workers instead? I mean we don't have work_mem_maintenance.
On Wed, Jan 10, 2018 at 3:29 PM, Evgeniy Shishkin <itparanoia@gmail.com> wrote: >> On Jan 10, 2018, at 21:45, Robert Haas <robertmhaas@gmail.com> wrote: >> The documentation for max_parallel_workers_maintenance cribs from the >> documentation for max_parallel_workers_per_gather in saying that we'll >> use fewer workers than expected "which may be inefficient". > > Can we actually call it max_parallel_maintenance_workers instead? > I mean we don't have work_mem_maintenance. Good point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 10, 2018 at 10:45 AM, Robert Haas <robertmhaas@gmail.com> wrote: > The addition to README.parallel is basically wrong, because workers > have been allowed to write WAL since the ParallelContext machinery. > See the > XactLastRecEnd handling in parallel.c. Workers can, for example, due > HOT cleanups during SELECT scans, just as the leader can. The > language here is obsolete anyway in light of commit > e9baa5e9fa147e00a2466ab2c40eb99c8a700824, but this isn't the right way > to update it. I'll propose a separate patch for that. WFM. > The change to the ParallelContext signature in parallel.h makes an > already-overlength line even longer. A line break seems warranted > just after the first argument, plus pgindent afterward. Okay. > I am not a fan of the leader-as-worker terminology. The leader is not > a worker, full stop. I think we should instead talk about whether the > leader participates (so, ii_LeaderAsWorker -> ii_LeaderParticipates, > for example, plus many comment updates). Similarly, it seems > SortCoordinateData's nLaunched should be nParticipants, and BTLeader's > nworkertuplesorts should be nparticipanttuplesorts. Okay. > There is also the question of whether we want to respect > parallel_leader_participation in this context. The issues which might > motivate the desire for such behavior in the context of a query do not > exist when creating a btree index, so maybe we're just making this > complicated. On the other hand, if some other type of parallel index > build does end up doing a Gather-like operation then we might regret > deciding that parallel_leader_participation doesn't apply to index > builds, so maybe it's OK the way we have it. On the third hand, the > complexity of having the leader maybe-participate seems like it > extends to a fair number of places in the code, and getting rid of all > that complexity seems appealing. I only added support for the leader-as-worker case because I assumed that it was important to have CREATE INDEX process allocation work "analogously" to parallel query, even though it's clear that the two situations are not really completely comparable when you dig deep enough. Getting rid of the leader participating as a worker has theoretical downsides, but real practical upsides. I am also tempted to just get rid of it. > One place where this actually causes a problem is the message changes > to index_build(). The revised ereport() violates translatability > guidelines, which require that messages not be assembled from pieces. > See https://www.postgresql.org/docs/devel/static/nls-programmer.html#NLS-GUIDELINES Noted. Another place where a worker Tuplesortstate in the leader process causes problems is plan_create_index_workers(), especially because of things like force_parallel_mode and parallel_leader_participation. > A comment added to tuplesort.h says that work_mem should be at least > 64KB, but does not give any reason. I think one should be given, at > least briefly, so that someone looking at these comments in the future > can, for example, figure out whether the comment is still correct > after future code changes. Or else, remove the comment. The reason for needing to do this is that a naive division of work_mem/maintenance_work_mem within a caller like nbtsort.c, could, in general, result in a workMem that is as low as 0 (due to integer truncation of the result of a division). Clearly *that* is too low. In fact, we need at least enough memory to store the initial minimal memtuples array, which needs to respect ALLOCSET_SEPARATE_THRESHOLD. There is also the matter of having per-tape space for TAPE_BUFFER_OVERHEAD when we spill to disk (note also the special case for pass-by-value datum sorts low on memory). There have been a couple of unavoidable OOM bugs in tuplesort over the years already. How about I remove the comment, but have tuplesort_begin_common() force each Tuplesortstate to have workMem that is at least 64KB (minimum legal work_mem value) in all cases? We can just formalize the existing assumption that workMem cannot go below 64KB, really, and it isn't reasonably to use so little workMem within a parallel worker (it should be prevented by plan_create_index_workers() in the real world, where parallelism is never artificially forced). There is no need to make this complicated by worrying about whether or not 64KB is the true minimum (value that avoids "can't happen" errors), IMV. > + * Parallel sort callers are required to coordinate multiple tuplesort states > + * in a leader process, and one or more worker processes. The leader process > > I think the comma should be removed. As written it, it looks like we > are coordinating multiple tuplesort states in a leader process, and, > separately, we are coordinating one or more worker processes. Okay. > Generally, I think the comments in tuplesort.h are excellent. Thanks. > I really like the overview of how the new interfaces should be used, > although I find it slightly wonky that the leader needs two separate > Tuplesortstates if it wants to participate. Assuming that we end up actually allowing the leader to participate as a worker at all, then I think that having that be a separate Tuplesortstate is better than the alternative. There are a couple of places where I can see it mattering. For one thing, dtrace compatible traces become more complicated -- LogicalTapeSetBlocks() is reported to dtrace within workers (though not via trace_sort logging, where it is considered redundant). For another, I think we'd need to have multiple tapesets at the same time for the leader if it only had one Tuplesortstate, which means multiple new Tuplesortstate fields. In short, having a distinct Tuplesortstate means almost no special cases. Maybe you find it slightly wonky because parallel CREATE INDEX really does have the leader participate as a worker with minimal caveats. It will do just as much work as a real parallel worker process, which really is quite a new thing, in a way. > I don't understand why this patch needs to tinker with the tests in > vacuum.sql. The comments say that "If we did not do this, errors > raised would concern running ANALYZE in parallel mode." However, why > should parallel CREATE INDEX having any impact on ANALYZE at all? > Also, as a practical matter, if I revert those changes, 'make check' > still passes with or without force_parallel_mode=on. This certain wasn't true before now -- parallel CREATE INDEX could previously cause the test to give different output for one error message. I'll revert that change. I imagine (though haven't verified) that this happened because, as you pointed out separately, I didn't get the memo about e9baa5e9 (this is the commit you mentioned in relation to README.parallel/parallel write DML). > I really dislike the fact that this patch invents another thing for > force_parallel_mode to do. I invented force_parallel_mode mostly as a > way of testing that functions were correctly labeled for > parallel-safety, and I think it would be just fine if it never does > anything else. This is not something that I feel strongly about, though I think it is useful to test parallel CREATE INDEX in low memory conditions, one way or another. > I don't think it will be a > good thing for PostgreSQL if we end up with force_parallel_mode=on as > a general "use parallelism even though it's stupid" flag, requiring > supporting code in many different places throughout the code base and > a laundry list of not-actually-useful behavior changes in the > documentation. I will admit that "use parallelism even though it's stupid" is how I thought of force_parallel_mode=on. I thought of it as a testing option that users shouldn't need to concern themselves with in almost all cases. I am not at all attached to what I did with force_parallel_mode, except that it provides some way to test low memory conditions, and it was something that I thought you'd expect from this patch. > What I think would be a lot more useful, and what I sort of expected > the patch to have, is a way for a user to explicitly control the > number of workers requested for a CREATE INDEX operation. I tend to agree. It wouldn't be *very* compelling, because there doesn't seem to be that much to how many workers are used anyway, but it's worth having. > We all know > that the cost model is crude and that may be OK -- though it would be > interesting to see some research on what the run times actually look > like for various numbers of workers at various table sizes and > work_mem settings -- but it will be inconvenient for DBAs who actually > know what number of workers they want to use to instead get whatever > value plan_create_index_workers() decide to emit. I did a lot of unpublished research on this over a year ago, and noticed nothing strange then. I guess I could use the box that Postgres Pro provided me with access to to revisit it. > They can force it > by setting the parallel_workers reloption, but that affects queries. > They can probably also do it by setting min_parallel_table_scan_size = > 0 and max_parallel_workers_maintenance to whatever value they want, > but I think it would be convenient for there to be a more > straightforward way to do it, or at least some documentation in the > CREATE INDEX page about how to get the number of workers you really > want. To be clear, I don't think that this is a must-fix issue for > this patch to get committed, but I do think that all reference to > force_parallel_mode=on should go away. The only reason I didn't add a "just use this many parallel workers" option myself already is that doing so introduces awkward ambiguities. Long ago, there was a parallel_workers index storage param added by the patch, which you didn't like because it confused the issue in just the same way as the table parallel_workers storage param does now, would have confused parallel index scan, and so on. I counter-argued that though this was ugly, it seemed to be how it worked on other systems (more of an explanation than an argument, actually, because I find it hard to know what to do here). You're right that there should be a way to simply force the number of parallel workers for DDL commands that use parallelism. You're also right to be concerned about that not being a storage parameter (index or otherwise), because that modifies run time behavior in a surprising way (even if this pitfall *is* actually something that users of SQL Server and Oracle have to live with). Adding something to the CREATE INDEX grammar just for this *also* seems confusing, because users will think that it is a storage parameter even though it isn't (I'm pretty sure that almost no Postgres user can give you a definition of a storage parameter without some prompting). I share your general feelings on all of this, but I really don't know what to do about it. Which of these alternatives is the least worst, all things considered? > I do not like the way that this patch wants to turn the section of the > documentation on when parallel query can be used into a discussion of > when parallelism can be used. I think it would be better to leave > that section alone and instead document under CREATE INDEX the > concerns specific to parallel index build. I think this will be easier > for users to understand and far easier to maintain as the number of > parallel DDL operations increases, which I expect it to do somewhat > explosively. WFM. > The documentation for max_parallel_workers_maintenance cribs from the > documentation for max_parallel_workers_per_gather in saying that we'll > use fewer workers than expected "which may be inefficient". However, > for parallel CREATE INDEX, that trailing clause is, at least as far as > I can see, not applicable. Fair point. Will revise. > (Various points on phrasing and punctuation) That all seems fine. > + * To support parallel sort operations involving coordinated callers to > + * tuplesort.c routines across multiple workers, it is necessary to > + * concatenate each worker BufFile/tapeset into one single leader-wise > + * logical tapeset. Workers should have produced one final materialized > + * tape (their entire output) when this happens in leader; there will always > + * be the same number of runs as input tapes, and the same number of input > + * tapes as workers. > > I can't interpret the word "leader-wise". A partition-wise join is a > join done one partition at a time, but a leader-wise logical tape set > is not done one leader at a time. If there's another meaning to the > affix -wise, I'm not familiar with it. Don't we just mean "a single > logical tapeset managed by the leader"? Yes, we do. Will change. > There's a lot here I haven't grokked yet, but I'm running out of > mental energy so I think I'll send this for now and work on this some > more when time permits, hopefully tomorrow. The good news is that the things that you took issue with were about what I expected you to take issue with. You seem to be getting through the review of this patch very efficiently. -- Peter Geoghegan
On Wed, Jan 10, 2018 at 1:31 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Can we actually call it max_parallel_maintenance_workers instead? >> I mean we don't have work_mem_maintenance. > > Good point. WFM. -- Peter Geoghegan
On Wed, Jan 10, 2018 at 5:05 PM, Peter Geoghegan <pg@bowt.ie> wrote: > How about I remove the comment, but have tuplesort_begin_common() > force each Tuplesortstate to have workMem that is at least 64KB > (minimum legal work_mem value) in all cases? We can just formalize the > existing assumption that workMem cannot go below 64KB, really, and it > isn't reasonably to use so little workMem within a parallel worker (it > should be prevented by plan_create_index_workers() in the real world, > where parallelism is never artificially forced). +1. I think this doesn't even need to be documented. You can simply write a comment that says something /* Always allow each worker to use at least 64kB. If the amount of memory allowed for the sort is very small, this might technically cause us to exceed it, but since it's tiny compared to the overall memory cost of running a worker in the first place, it shouldn't matter. */ > I share your general feelings on all of this, but I really don't know > what to do about it. Which of these alternatives is the least worst, > all things considered? Let's get the patch committed without any explicit way of forcing the number of workers and then think about adding that later. It will be good if you and Rushabh can agree on who will produce the next version of this patch, and also if I have some idea when that version should be expected. On another point, we will need to agree on how this should be credited in an eventual commit message. I do not agree with adding Heikki as an author unless he contributed code, but we can credit him in some other way, like "Thanks are also due to Heikki Linnakangas for significant improvements to X, Y, and Z that made this patch possible." I assume the author credit will be "Peter Geoghegan, Rushabh Lathia" in that order, but let me know if anyone thinks that isn't the right idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > + * To support parallel sort operations involving coordinated callers to > + * tuplesort.c routines across multiple workers, it is necessary to > + * concatenate each worker BufFile/tapeset into one single leader-wise > + * logical tapeset. Workers should have produced one final materialized > + * tape (their entire output) when this happens in leader; there will always > + * be the same number of runs as input tapes, and the same number of input > + * tapes as workers. > > I can't interpret the word "leader-wise". A partition-wise join is a > join done one partition at a time, but a leader-wise logical tape set > is not done one leader at a time. If there's another meaning to the > affix -wise, I'm not familiar with it. Don't we just mean "a single > logical tapeset managed by the leader"? https://www.merriam-webster.com/dictionary/-wise -wise adverb combining form Definition of -wise 1 a : in the manner of crabwise fanwise b : in the position or direction of slantwise clockwise 2 : with regard to : in respect of dollarwise I think "one at a time" is not the right way to interpret the affix. Rather, a "partitionwise join" is a join done "in the manner of partitions", that is, the characteristics of the partitions are considered when the join is done. I'm not defending the "leader-wise" term here, though, because I can't make sense of it, regardless of how I interpret the -wise affix. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jan 10, 2018 at 2:21 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I share your general feelings on all of this, but I really don't know >> what to do about it. Which of these alternatives is the least worst, >> all things considered? > > Let's get the patch committed without any explicit way of forcing the > number of workers and then think about adding that later. It could be argued that you need some way of forcing low memory in workers with any committed version. So while this sounds reasonable, it might not be compatible with throwing out what I've done with force_parallel_mode up-front, before you commit anything. What do you think? > It will be good if you and Rushabh can agree on who will produce the > next version of this patch, and also if I have some idea when that > version should be expected. I'll take it. > On another point, we will need to agree > on how this should be credited in an eventual commit message. I do > not agree with adding Heikki as an author unless he contributed code, > but we can credit him in some other way, like "Thanks are also due to > Heikki Linnakangas for significant improvements to X, Y, and Z that > made this patch possible." I agree that I should have been more nuanced with this. Here's what I intended: Heikki is not the author of any of the code in the final commit, but he is morally a (secondary) author of the feature as a whole, and should be credited as such within the final release notes. This is justified by the history here, which is that he was involved with the patch fairly early on, and did some work that was particularly important to the feature, that almost certainly would not otherwise have happened. Sure, it helped the serial case too, but much less so. That's really not why he did it. > I assume the author credit will be "Peter > Geoghegan, Rushabh Lathia" in that order, but let me know if anyone > thinks that isn't the right idea. "Peter Geoghegan, Rushabh Lathia" seems right. Thomas did write a very small amount of the actual code, but I think it was more of a review thing (he is already credited as a reviewer). -- Peter Geoghegan
On Wed, Jan 10, 2018 at 2:36 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I think "one at a time" is not the right way to interpret the affix. > Rather, a "partitionwise join" is a join done "in the manner of > partitions", that is, the characteristics of the partitions are > considered when the join is done. > > I'm not defending the "leader-wise" term here, though, because I can't > make sense of it, regardless of how I interpret the -wise affix. I've already conceded the point, but fwiw "leader-wise" comes from the idea of having a leader-wise space following concatenating worker tapes (who have original/worker-wise space). We must apply an offset to get from a worker-wise offset to a leader-wise offset. This made more sense in an earlier version. I overlooked this during recent self review. -- Peter Geoghegan
On Thu, Jan 11, 2018 at 11:42 AM, Peter Geoghegan <pg@bowt.ie> wrote: > "Peter Geoghegan, Rushabh Lathia" seems right. Thomas did write a very > small amount of the actual code, but I think it was more of a review > thing (he is already credited as a reviewer). +1 -- Thomas Munro http://www.enterprisedb.com
On Thu, Jan 11, 2018 at 3:35 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jan 10, 2018 at 1:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Can we actually call it max_parallel_maintenance_workers instead?
>> I mean we don't have work_mem_maintenance.
>
> Good point.
WFM.
This is good point. I agree with max_parallel_maintenance_workers.
--
Peter Geoghegan
Rushabh Lathia
On Wed, Jan 10, 2018 at 5:42 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Jan 10, 2018 at 2:21 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I share your general feelings on all of this, but I really don't know >>> what to do about it. Which of these alternatives is the least worst, >>> all things considered? >> >> Let's get the patch committed without any explicit way of forcing the >> number of workers and then think about adding that later. > > It could be argued that you need some way of forcing low memory in > workers with any committed version. So while this sounds reasonable, > it might not be compatible with throwing out what I've done with > force_parallel_mode up-front, before you commit anything. What do you > think? I think the force_parallel_mode thing is too ugly to live. I'm not sure that forcing low memory in workers is a thing we need to have, but if we do, then we'll have to invent some other way to have it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 11, 2018 at 11:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I think the force_parallel_mode thing is too ugly to live. I'm not > sure that forcing low memory in workers is a thing we need to have, > but if we do, then we'll have to invent some other way to have it. It might make sense to have the "minimum memory per participant" value come from a GUC, rather than be hard coded (it's currently hard-coded to 32MB). I don't think that it's that compelling as a user-visible option, but it might make sense as a testing option, that we might very well decide to kill before v11 is released (we might kill it when we come up with an acceptable interface for "just use this many workers" in a later commit, which I think we'll definitely end up doing anyway). By setting the minimum participant memory to 0, you can then rely on the parallel_workers table storage param forcing the number of worker processes that we'll request. You can accomplish the same thing with "min_parallel_table_scan_size = 0", of course. What do you think of that idea? To be clear, I'm not actually arguing that we need any of this. My point about being able to test low memory conditions from the first commit is that insisting on it is reasonable. I don't actually feel strongly either way, though, and am not doing any insisting myself. -- Peter Geoghegan
On Thu, Jan 11, 2018 at 12:06 PM, Peter Geoghegan <pg@bowt.ie> wrote: > It might make sense to have the "minimum memory per participant" value > come from a GUC, rather than be hard coded (it's currently hard-coded > to 32MB). > What do you think of that idea? A third option here is to specifically recognize that compute_parallel_worker() returned a value based on the table storage param max_workers, and for that reason alone no "insufficient memory per participant" decrementing/vetoing should take place. That is, when the max_workers param is set, perhaps it should be completely impossible for CREATE INDEX to ignore it for any reason other than an inability to launch parallel workers (though that could be due to the max_parallel_workers GUC's setting). You could argue that we should do this anyway, I suppose. -- Peter Geoghegan
On Thu, Jan 11, 2018 at 3:25 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Jan 11, 2018 at 12:06 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> It might make sense to have the "minimum memory per participant" value >> come from a GUC, rather than be hard coded (it's currently hard-coded >> to 32MB). > >> What do you think of that idea? > > A third option here is to specifically recognize that > compute_parallel_worker() returned a value based on the table storage > param max_workers, and for that reason alone no "insufficient memory > per participant" decrementing/vetoing should take place. That is, when > the max_workers param is set, perhaps it should be completely > impossible for CREATE INDEX to ignore it for any reason other than an > inability to launch parallel workers (though that could be due to the > max_parallel_workers GUC's setting). > > You could argue that we should do this anyway, I suppose. Yes, I think this sounds like a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 11, 2018 at 1:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> A third option here is to specifically recognize that >> compute_parallel_worker() returned a value based on the table storage >> param max_workers, and for that reason alone no "insufficient memory >> per participant" decrementing/vetoing should take place. That is, when >> the max_workers param is set, perhaps it should be completely >> impossible for CREATE INDEX to ignore it for any reason other than an >> inability to launch parallel workers (though that could be due to the >> max_parallel_workers GUC's setting). >> >> You could argue that we should do this anyway, I suppose. > > Yes, I think this sounds like a good idea. Cool. I've already implemented this in my local working copy of the patch. That settles that. If I'm not mistaken, the only outstanding question at this point is whether or not we're going to give in and completely remove parallel leader participation entirely. I suspect that we won't end up doing that, because while it's not very useful, it's also not hard to support. Besides, to some extent that's the expectation that has been established already. I am not far from posting a revision that incorporates all of your feedback. Expect that tomorrow afternoon your time at the latest. Of course, you may have more feedback for me in the meantime. Let me know if I should hold off on posting a new version. -- Peter Geoghegan
On Wed, Jan 10, 2018 at 1:45 PM, Robert Haas <robertmhaas@gmail.com> wrote: > There's a lot here I haven't grokked yet, but I'm running out of > mental energy so I think I'll send this for now and work on this some > more when time permits, hopefully tomorrow. Looking at the logtape changes: While the patch contains, as I said before, an excellent set of how-to directions explaining how to use the new parallel sort facilities in tuplesort.c, there seems to be no such thing for logtape.c, and as a result I find it a bit unclear how the interface is supposed to work. I think it would be good to add a similar summary here. It seems like the words "leader" and "worker" here refer to the leader of a parallel operation and the associated workers, but do we really need to make that assumption? Couldn't we generally describe this as merging a bunch of 1-tape LogicalTapeSets created from a SharedFileSet into a single LogicalTapeSet that can thereafter be read by the process that does the merging? + /* Pass worker BufFile pieces, and a placeholder leader piece */ + for (i = 0; i < lts->nTapes; i++) + { + lt = <s->tapes[i]; + + /* + * Build concatenated view of all BufFiles, remembering the block + * number where each source file begins. + */ + if (i < lts->nTapes - 1) Unless I'm missing something, the "if" condition just causes the last pass through this loop to do nothing. If so, why not just change the loop condition to i < lts->nTapes - 1 and drop the "if" statement altogether? + char filename[MAXPGPATH] = {0}; I don't think you need = {0}, because pg_itoa is about to clobber it anyway. + /* Alter worker's tape state (generic values okay for leader) */ What do you mean by generic values? + * Each tape is initialized in write state. Serial callers pass ntapes, but + * NULL arguments for everything else. Parallel worker callers pass a + * shared handle and worker number, but tapeset should be NULL. Leader + * passes worker -1, a shared handle, and shared tape metadata. These are + * used to claim ownership of worker tapes. This comment doesn't match the actual function definition terribly well. Serial callers don't pass NULL for "everything else", because "int worker" is not going to be NULL. For parallel workers, it's not entirely obvious whether "a shared handle" means TapeShare *tapes or SharedFileSet *fileset. "tapeset" sounds like an argument name, but there is no such argument. lt->max_size looks like it might be an optimization separate from the overall patch, but maybe I'm wrong about that. + /* palloc() larger than MaxAllocSize would fail */ lt->buffer = NULL; lt->buffer_size = 0; + lt->max_size = MaxAllocSize; The comment about palloc() should move down to where you assign max_size. Generally we avoid returning a struct type, so maybe LogicalTapeFreeze() should instead grow an out parameter of type TapeShare * which it populates only if not NULL. Won't LogicalTapeFreeze() fail an assertion in BufFileExportShared() if the file doesn't belong to a shared fileset? If you adopt the previous suggestion, we can probably just make whether to call this contingent on whether the TapeShare * out parameter is provided. I'm not confident I completely understand what's going on with the logtape stuff yet, so I might have more comments (or better ones) after I study this further. To your question about whether to go ahead and post a new version, I'm OK to keep reviewing this version for a little longer or to switch to a new one, as you prefer. I have not made any local changes, just written a blizzard of email text. :-p -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 11, 2018 at 2:26 PM, Robert Haas <robertmhaas@gmail.com> wrote: > While the patch contains, as I said before, an excellent set of how-to > directions explaining how to use the new parallel sort facilities in > tuplesort.c, there seems to be no such thing for logtape.c, and as a > result I find it a bit unclear how the interface is supposed to work. > I think it would be good to add a similar summary here. Okay. I came up with something for that. > It seems like the words "leader" and "worker" here refer to the leader > of a parallel operation and the associated workers, but do we really > need to make that assumption? Couldn't we generally describe this as > merging a bunch of 1-tape LogicalTapeSets created from a SharedFileSet > into a single LogicalTapeSet that can thereafter be read by the > process that does the merging? It's not so much an assumption as it is the most direct way of referring to these various objects. logtape.c is very clearly a submodule of tuplesort.c, so this felt okay to me. There are already several references to what tuplesort.c expects. I'm not going to argue about it if you insist on this, though I do think that trying to describe things in more general terms would be a net loss. It would kind of come off as feigning ignorance IMV. There is nothing that logtape.c could know less about other than names/roles, and I find it hard to imagine those changing, even when we add support for partitioning/distribution sort (where logtape.c handles "redistribution", something discussed early in this project's lifetime). > + /* Pass worker BufFile pieces, and a placeholder leader piece */ > + for (i = 0; i < lts->nTapes; i++) > + { > + lt = <s->tapes[i]; > + > + /* > + * Build concatenated view of all BufFiles, remembering the block > + * number where each source file begins. > + */ > + if (i < lts->nTapes - 1) > > Unless I'm missing something, the "if" condition just causes the last > pass through this loop to do nothing. If so, why not just change the > loop condition to i < lts->nTapes - 1 and drop the "if" statement > altogether? The last "lt" in the loop is in fact used separately, just outside the loop. But that use turns out to have been subtly wrong, apparently due to a problem with converting logtape.c to use the shared buffile stuff. This buglet would only have caused writing to the leader tape to break (never trace_sort instrumentation), something that isn't supported anyway due to the restrictions that shared BufFiles have. But, we should, on general principle, be able to write to the leader tape if and when shared buffiles learn to support writing (after exporting original BufFile in worker). Buglet fixed in my local working copy. I did so in a way that changes loop test along the lines you suggest. This should make the whole design of tape concatenation a bit clearer. > + char filename[MAXPGPATH] = {0}; > > I don't think you need = {0}, because pg_itoa is about to clobber it anyway. Okay. > + /* Alter worker's tape state (generic values okay for leader) */ > > What do you mean by generic values? I mean that the leader's tape doesn't need to have lt->firstBlockNumber set, because it's empty -- it can remain -1. Same applies to lt->offsetBlockNumber, too. I'll remove the text within parenthesis, since it seems redundant given the structure of the loop. > + * Each tape is initialized in write state. Serial callers pass ntapes, but > + * NULL arguments for everything else. Parallel worker callers pass a > + * shared handle and worker number, but tapeset should be NULL. Leader > + * passes worker -1, a shared handle, and shared tape metadata. These are > + * used to claim ownership of worker tapes. > > This comment doesn't match the actual function definition terribly > well. Serial callers don't pass NULL for "everything else", because > "int worker" is not going to be NULL. For parallel workers, it's not > entirely obvious whether "a shared handle" means TapeShare *tapes or > SharedFileSet *fileset. "tapeset" sounds like an argument name, but > there is no such argument. Okay. I've tweaked things here. > lt->max_size looks like it might be an optimization separate from the > overall patch, but maybe I'm wrong about that. I think that it's pretty much essential. Currently, the MaxAllocSize restriction is needed in logtape.c for the same reason that it's needed anywhere else. Not much to talk about there. The new max_size thing is about more than that, though -- it's really about not stupidly allocating up to a full MaxAllocSize when you already know that you're going to use next to no memory. You don't have this issue with serial sorts because serial sorts that only sort a tiny number of tuples never end up as external sorts -- when you end up doing a serial external sort, clearly you're never going to allocate an excessive amount of memory up front in logtape.c, because you are by definition operating in a memory constrained fashion. Not so for parallel external tuplesorts. Think spool2 in a parallel unique index build, in the case where there are next to no recently dead tuples (the common case). > + /* palloc() larger than MaxAllocSize would fail */ > lt->buffer = NULL; > lt->buffer_size = 0; > + lt->max_size = MaxAllocSize; > > The comment about palloc() should move down to where you assign max_size. Okay. > Generally we avoid returning a struct type, so maybe > LogicalTapeFreeze() should instead grow an out parameter of type > TapeShare * which it populates only if not NULL. Okay. I've modified LogicalTapeFreeze(), adding a "share" output argument and reverting to returning void, as before. > Won't LogicalTapeFreeze() fail an assertion in BufFileExportShared() > if the file doesn't belong to a shared fileset? If you adopt the > previous suggestion, we can probably just make whether to call this > contingent on whether the TapeShare * out parameter is provided. Oops, you're right. It will be taken care of by the LogicalTapeFreeze() function change signature change you suggested. > I'm not confident I completely understand what's going on with the > logtape stuff yet, so I might have more comments (or better ones) > after I study this further. To your question about whether to go > ahead and post a new version, I'm OK to keep reviewing this version > for a little longer or to switch to a new one, as you prefer. I have > not made any local changes, just written a blizzard of email text. > :-p Great. Thanks. I've caught up with you again. I just need to take a look at what I came up with with fresh eyes, and maybe do some more testing. -- Peter Geoghegan
On Sat, Jan 6, 2018 at 3:47 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Jan 2, 2018 at 8:43 PM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote: >> I agree that plan_create_index_workers() needs to count the leader as a >> normal worker for the CREATE INDEX. So what you proposing is - when >> parallel_leader_participation is true launch (return value of >> compute_parallel_worker() - 1) >> workers. true ? > > Almost. We need to not subtract one when only one worker is indicated > by compute_parallel_worker(). I also added some new stuff there, to > consider edge cases with the parallel_leader_participation GUC. > >>> I'm working on fixing up what you posted. I'm probably not more than a >>> week away from posting a patch that I'm going to mark "ready for >>> committer". I've already made the change above, and once I spend time >>> on trying to break the few small changes needed within buffile.c I'll >>> have taken it as far as I can, most likely. >>> >> >> Okay, once you submit the patch with changes - I will do one round of >> review for the changes. > > I've attached my revision. Changes include: > Few observations while skimming through the patch: 1. + if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent) { - snapshot = RegisterSnapshot(GetTransactionSnapshot()); - OldestXmin = InvalidTransactionId; /* not used */ + OldestXmin = GetOldestXmin(heapRelation, true); I think leader and workers should have the same idea of oldestXmin for the purpose of deciding the visibility of tuples. I think this is ensured in all form of parallel query as we do share the snapshot, however, same doesn't seem to be true for Parallel Index builds. 2. + + /* Wait on worker processes to finish (should be almost instant) */ + reltuples = _bt_leader_wait_for_workers(buildstate); Can't we use WaitForParallelWorkersToFinish for this purpose? The reason is that if we use a different mechanism here then we might need a different way to solve the problem related to fork failure. See thread [1]. Basically, what if postmaster fails to launch workers due to fork failure, the leader backend might wait indefinitely. [1] - https://commitfest.postgresql.org/16/1341/ -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jan 12, 2018 at 8:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > 1. > + if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent) > { > - snapshot = RegisterSnapshot(GetTransactionSnapshot()); > - OldestXmin = InvalidTransactionId; /* not used */ > + OldestXmin = GetOldestXmin(heapRelation, true); > > I think leader and workers should have the same idea of oldestXmin for > the purpose of deciding the visibility of tuples. I think this is > ensured in all form of parallel query as we do share the snapshot, > however, same doesn't seem to be true for Parallel Index builds. Hmm. Does it break anything if they use different snapshots? In the case of a query that would be disastrous because then you might get inconsistent results, but if the snapshot is only being used to determine what is and is not dead then I'm not sure it makes much difference ... unless the different snapshots will create confusion of some other sort. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 11, 2018 at 8:58 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I've caught up with you again. I just need to take a look at what I > came up with with fresh eyes, and maybe do some more testing. More comments: BufFileView() looks fairly pointless. It basically creates a copy of the input and, in so doing, destroys the input, which is a lot like returning the input parameter except that it uses more cycles. It does do a few things. First, it zeroes the offsets array instead of copying the offsets. But as used, those offsets would have been 0 anyway. Second, it sets the fileset parameter to NULL. But that doesn't actually seem to be important for anything: the fileset is only used when creating new files, and the BufFile must already be marked read-only, so we won't be doing that. It seems like this function can just be entirely removed and replaced by Assert()-ing some things about the target in BufFileViewAppend, which I would just rename to BufFileAppend. In miscadmin.h, I'd put the prototype for the new GUC next to max_worker_processes, not maintenance_work_mem. The ereport() in index_build will, I think, confuse people when it says that there are 0 parallel workers. I suggest splitting this into two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(... "building index \"%s\" on table \"%s\" serially" ...) else ereport(... "building index \"%s\" on table \"%s\" in parallel with request for %d parallel workers" ...). Might even need three cases to handle parallel_leader_participation without needing to assemble the message, unless we drop parallel_leader_participation support. The logic in IndexBuildHeapRangeScan() around need_register_snapshot and OldestXmin seems convoluted and not very well-edited to me. For example, need_register_snapshot is set to false in a block that is only entered when it's already false, and the comment that follows is supposed to be associated with GetOldestXmin() and makes no sense here. I suggest that you go back to the original code organization and then just insert an additional case for a caller-supplied scan, so that the overall flow looks like this: if (scan != NULL) { ... } else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent) { ... } else { ... } Along with that, I'd change the name of need_register_snapshot to need_unregister_snapshot (it's doing both jobs right now) and initialize it to false. If you enter the second of the above blocks then change it to true just after snapshot = RegisterSnapshot(GetTransactionSnapshot()). Then adjust the comment that begins "Prepare for scan of the base relation." by inserting an additional sentence just after that one: "If the caller has supplied a scan, just use it. Otherwise, in a normal index build..." and the rest as it is currently. + * This support code isn't reliable when called from within a parallel + * worker process due to the fact that our state isn't propagated. This is + * why parallel index builds are disallowed on catalogs. It is possible + * that we'll fail to catch an attempted use of a user index undergoing + * reindexing due the non-propagation of this state to workers, which is not + * ideal, but the problem is not particularly likely to go undetected due to + * our not doing better there. I understand the first two sentences, but I have no idea what the third one means, especially the part that says "not particularly likely to go undetected due to our not doing better there". It sounds scary that something bad is only "not particularly likely to go undetected"; don't we need to detect bad things reliably? But also, you used the word "not" three times and also the prefix "un-", meaning "not", once. Four negations in 13 words! Perhaps I'm not entirely in a position to cast aspersions on overly-complex phraseology -- the pot calling the kettle black and all that -- but I bet that will be a lot clearer if you reduce the number of negations to either 0 or 1. The comment change in standard_planner() doesn't look helpful to me; I'd leave it out. + * tableOid is the table that index is to be built on. indexOid is the OID + * of a index to be created or reindexed (which must be a btree index). I'd rewrite that first sentence to end "the table on which the index is to be built". The second sentence should say "an index" rather than "a index". + * leaderWorker indicates whether leader will participate as worker or not. + * This needs to be taken into account because leader process is guaranteed to + * be idle when not participating as a worker, in contrast with conventional + * parallel relation scans, where the leader process typically has plenty of + * other work to do relating to coordinating the scan, etc. For CREATE INDEX, + * leader is usually treated as just another participant for our scaling + * calculation. OK, I get the first sentence. But the rest of this appears to be partially irrelevant and partially incorrect. The degree to which the leader is likely to be otherwise occupied isn't very relevant; as long as we think it's going to do anything at all, we have to account for it somehow. Also, the issue isn't that in a query the leader would be busy "coordinating the scan, etc." but rather that it would have to read the tuples produced by the Gather (Merge) node. I think you could just delete everything from "This needs to be..." through the end. You can cover the details of how it's used closer to the point where you do anything with leaderWorker (or, as I assume it will soon be, leaderParticipates). But, actually, I think we would be better off just ripping leaderWorker/leaderParticipates out of this function altogether. compute_parallel_worker() is not really under any illusion that it's computing a number of *participants*; it's just computing a number of *workers*. Deducting 1 when the leader is also participating but only when at least 2 workers were computed leads to an oddity: for a regular parallel sequential scan, the number of workers increases by 1 when the table size increases by a factor of 3, but here, the number of workers increases from 1 to 2 when the table size increases by a factor of 9, and then by 1 for every further multiple of 3. There doesn't seem to be any theoretical or practical justification for such behavior, or with being inconsistent with what parallel sequential scan does otherwise. I think it's fine for parallel_leader_participation=off to simply mean that you get one fewer participants. That's actually what would happen with parallel query, too. Parallel query would consider parallel_leader_participation later, in get_parallel_divisor(), when working out the cost of one path vs. another, but it doesn't use it to choose the number of workers. So it seems to me that getting rid of all of the workerLeader considerations will make it both simpler and more consistent with what we do for queries. To be clear, I don't think there's any real need for the cost model we choose for CREATE INDEX to be the same as the one we use for regular scans. The problem with regular scans is that it's very hard to predict how many workers we can usefully use; it depends not only on the table size but on what plan nodes get stacked on top of it higher in the plan tree. In a perfect world we'd like to add as many workers as required to avoid having the query be I/O bound and then stop, but that would require both the ability to predict future system utilization and a heck of a lot more knowledge than the planner can hope to have at this point. If you have an idea how to make a better cost model than this for CREATE INDEX, I'm willing to consider other options. If you don't, or want to propose that as a follow-up patch, then I think it's OK to use what you've got here for starters. I just don't want it to be more baroque than necessary. I think that the naming of the wait events could be improved. Right now, they are named by which kind of process does the waiting, but it really should be named based on what the thing for which we're waiting. I also suggest that we could just write Sort instead of Tuplesort. In short, I suggest ParallelTuplesortLeader -> ParallelSortWorkersDone and ParallelTuplesortLeader -> ParallelSortTapeHandover. Not for this patch, but I wonder if it might be a worthwhile future optimization to allow workers to return multiple tapes to the leader. One doesn't want to go crazy with this, of course. If the worker returns 100 tapes, then the leader might get stuck doing multiple merge passes, which would be a foolish way to divide up the labor, and even if that doesn't happen, Amdahl's law argues for minimizing the amount of work that is not done in parallel. Still, what if a worker (perhaps after merging) ends up with 2 or 3 tapes? Is it really worth merging them so that the leader can do a 5-way merge instead of a 15-way merge? Maybe this case is rare in practice, because multiple merge passes will be uncommon with reasonable values of work_mem, and it might be silly to go to the trouble of firing up workers if they'll only generate a few runs in total. Just a thought. + * Make sure that the temp file(s) underlying the tape set are created in + * suitable temp tablespaces. This is only really needed for serial + * sorts. This comment makes me wonder whether it is "sorta" needed for parallel sorts. - if (trace_sort) + if (trace_sort && !WORKER(state)) I have a feeling we still want to get this output even from workers, but maybe I'm missing something. + arg5 indicates serial, parallel worker, or parallel leader sort.</entry> I think it should say what values are used for each case. + /* Release worker tuplesorts within leader process as soon as possible */ IIUC, the worker tuplesorts aren't really holding onto much of anything in terms of resources. I think it might be better to phrase this as /* The sort we just did absorbed the final tapes produced by these tuplesorts, which are of no further use. */ or words to that effect. Instead of making a special case in CreateParallelContext for serializable_okay, maybe index_build should just use SetConfigOption() to force the isolation level to READ COMMITTED right after it does NewGUCNestLevel(). The change would only be temporary because the subsequent call to AtEOXact_GUC() will revert it. The point isn't really that CREATE INDEX is somehow exempt from the problem that SIREAD locks haven't been updated to work correctly with parallelism; it's that CREATE INDEX itself is defined to ignore serializability concerns. There is *still* more to review here, but my concentration is fading. If you could post an updated patch after adjusting for the comments above, I think that would be helpful. I'm not totally out of things to review that I haven't already looked over once, but I think I'm close. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 12, 2018 at 6:14 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 12, 2018 at 8:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> 1. >> + if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent) >> { >> - snapshot = RegisterSnapshot(GetTransactionSnapshot()); >> - OldestXmin = InvalidTransactionId; /* not used */ >> + OldestXmin = GetOldestXmin(heapRelation, true); >> >> I think leader and workers should have the same idea of oldestXmin for >> the purpose of deciding the visibility of tuples. I think this is >> ensured in all form of parallel query as we do share the snapshot, >> however, same doesn't seem to be true for Parallel Index builds. > > Hmm. Does it break anything if they use different snapshots? In the > case of a query that would be disastrous because then you might get > inconsistent results, but if the snapshot is only being used to > determine what is and is not dead then I'm not sure it makes much > difference ... unless the different snapshots will create confusion of > some other sort. I think that this is fine. GetOldestXmin() is only used when we have a ShareLock on the heap relation, and the snapshot is SnapshotAny. We're only talking about the difference between HEAPTUPLE_DEAD and HEAPTUPLE_RECENTLY_DEAD here. Indexing a heap tuple when that wasn't strictly necessary by the time you got to it is normal. However, it's not okay that GetOldestXmin()'s second argument is true in the patch, rather than PROCARRAY_FLAGS_VACUUM. That's due to bitrot that was not caught during some previous rebase (commit af4b1a08 changed the signature). Will fix. You've given me a lot more to work through in your most recent mail, Robert. I will probably get the next revision to you on Monday. Doesn't seem like there is much point in posting what I've done so far. -- Peter Geoghegan
On Sat, Jan 13, 2018 at 1:25 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Jan 12, 2018 at 6:14 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Jan 12, 2018 at 8:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> 1. >>> + if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent) >>> { >>> - snapshot = RegisterSnapshot(GetTransactionSnapshot()); >>> - OldestXmin = InvalidTransactionId; /* not used */ >>> + OldestXmin = GetOldestXmin(heapRelation, true); >>> >>> I think leader and workers should have the same idea of oldestXmin for >>> the purpose of deciding the visibility of tuples. I think this is >>> ensured in all form of parallel query as we do share the snapshot, >>> however, same doesn't seem to be true for Parallel Index builds. >> >> Hmm. Does it break anything if they use different snapshots? In the >> case of a query that would be disastrous because then you might get >> inconsistent results, but if the snapshot is only being used to >> determine what is and is not dead then I'm not sure it makes much >> difference ... unless the different snapshots will create confusion of >> some other sort. > > I think that this is fine. GetOldestXmin() is only used when we have a > ShareLock on the heap relation, and the snapshot is SnapshotAny. We're > only talking about the difference between HEAPTUPLE_DEAD and > HEAPTUPLE_RECENTLY_DEAD here. Indexing a heap tuple when that wasn't > strictly necessary by the time you got to it is normal. > Yeah, but this would mean that now with parallel create index, it is possible that some tuples from the transaction would end up in index and others won't. In general, this makes me slightly nervous mainly because such a case won't be possible without the parallel option for create index, but if you and Robert are okay with it as there is no fundamental problem, then we might as well leave it as it is or maybe add a comment saying so. Another point is that the information about broken hot chains indexInfo->ii_BrokenHotChain is getting lost. I think you need to coordinate this information among backends that participate in parallel create index. Test to reproduce the problem is as below: create table tbrokenchain(c1 int, c2 varchar); insert into tbrokenchain values(3, 'aaa'); begin; set force_parallel_mode=on; update tbrokenchain set c2 = 'bbb' where c1=3; create index idx_tbrokenchain on tbrokenchain(c1); commit; Now, check the value of indcheckxmin in pg_index, it should be true, but with patch it is false. You can try with patch by not changing the value of force_parallel_mode; The patch uses both parallel_leader_participation and force_parallel_mode, but it seems the definition is different from what we have in Gather. Basically, even with force_parallel_mode, the leader is participating in parallel build. I see there is some discussion above about both these parameters and still, there is not complete agreement on the best way forward. I think we should have parallel_leader_participation as that can help in testing if nothing else. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Jan 13, 2018 at 6:02 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > The patch uses both parallel_leader_participation and > force_parallel_mode, but it seems the definition is different from > what we have in Gather. Basically, even with force_parallel_mode, the > leader is participating in parallel build. I see there is some > discussion above about both these parameters and still, there is not > complete agreement on the best way forward. I think we should have > parallel_leader_participation as that can help in testing if nothing > else. > Or maybe just have force_parallel_mode. I think one of these is required to facilitate some form of testing of the parallel code easily. As you can see from my previous email, it was quite easy to demonstrate a test with force_parallel_mode. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Jan 13, 2018 at 4:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Yeah, but this would mean that now with parallel create index, it is > possible that some tuples from the transaction would end up in index > and others won't. You mean some tuples from some past transaction that deleted a bunch of tuples and committed, but not before someone acquired a still-held snapshot that didn't see the deleter's transaction as committed yet? I guess that that is different, but it doesn't matter. All that matters is that in the end, the index contains all entries for all heap tuples visible to any possible snapshot (though possibly excluding some existing old snapshots iff we detect broken HOT chains during builds). > In general, this makes me slightly nervous mainly > because such a case won't be possible without the parallel option for > create index, but if you and Robert are okay with it as there is no > fundamental problem, then we might as well leave it as it is or maybe > add a comment saying so. Let me try to explain this another way, in terms of the high-level intuition that I have about it (Robert can probably skip this part). GetOldestXmin() returns a value that is inherently a *conservative* cut-off. In hot standby mode, it's possible for the value it returns to go backwards from a value previously returned within the same backend. Even with serial builds, the exact instant that GetOldestXmin() gets called could vary based on something like the OS scheduling of the process that runs CREATE INDEX. It could have a different value based only on that. It follows that it won't matter if parallel CREATE INDEX participants have a slightly different value, because the cut-off is all about the consistency of the index with what the universe of possible snapshots could see in the heap, not the consistency of different parts of the index with each other (the parts produced from heap tuples read from each participant). Look at how the pg_visibility module calls GetOldestXmin() to recheck -- it has to call GetOldestXmin() a second time, with a buffer lock held on a heap page throughout. It does this to conclusively establish that the visibility map is corrupt (otherwise, it could just be that the cut-off became stale). Putting all of this together, it would be safe for the HEAPTUPLE_RECENTLY_DEAD case within IndexBuildHeapRangeScan() to call GetOldestXmin() again (a bit like pg_visibility does), to avoid having to index an actually-fully-dead-by-now tuple (we could call HeapTupleSatisfiesVacuum() a second time for the heap tuple, hoping to get HEAPTUPLE_DEAD the second time around). This optimization wouldn't work out a lot of the time (it would only work out when an old snapshot went away during the CREATE INDEX), and would add procarraylock traffic, so we don't do it. But AFAICT it's feasible. > Another point is that the information about broken hot chains > indexInfo->ii_BrokenHotChain is getting lost. I think you need to > coordinate this information among backends that participate in > parallel create index. Test to reproduce the problem is as below: > > create table tbrokenchain(c1 int, c2 varchar); > insert into tbrokenchain values(3, 'aaa'); > > begin; > set force_parallel_mode=on; > update tbrokenchain set c2 = 'bbb' where c1=3; > create index idx_tbrokenchain on tbrokenchain(c1); > commit; > > Now, check the value of indcheckxmin in pg_index, it should be true, > but with patch it is false. You can try with patch by not changing > the value of force_parallel_mode; Ugh, you're right. That's a real howler. Will fix. Note that my stress-testing strategy has had a lot to do with verifying that a serial build has relfiles that are physically identical to parallel builds. Obviously that couldn't have caught this, because this only concerns the state of the pg_index catalog. > The patch uses both parallel_leader_participation and > force_parallel_mode, but it seems the definition is different from > what we have in Gather. Basically, even with force_parallel_mode, the > leader is participating in parallel build. I see there is some > discussion above about both these parameters and still, there is not > complete agreement on the best way forward. I think we should have > parallel_leader_participation as that can help in testing if nothing > else. I think that you're quite right that parallel_leader_participation needs to be supported for testing purposes. I had some sympathy for the idea that we should remove leader participation as a worker from the patch entirely, but the testing argument seems to clinch it. I'm fine with killing force_parallel_mode, though, because it will be possible to force the use of parallelism by using the existing parallel_workers table storage param in the next version of the patch, regardless of how small the table is. Thanks for the review. -- Peter Geoghegan
On Sun, Jan 14, 2018 at 1:43 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Sat, Jan 13, 2018 at 4:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Yeah, but this would mean that now with parallel create index, it is >> possible that some tuples from the transaction would end up in index >> and others won't. > > You mean some tuples from some past transaction that deleted a bunch > of tuples and committed, but not before someone acquired a still-held > snapshot that didn't see the deleter's transaction as committed yet? > I think I am talking about something different. Let me try to explain in some more detail. Consider a transaction T-1 has deleted two tuples from tab-1, first on page-1 and second on page-2 and committed. There is a parallel transaction T-2 which has an open snapshot/query due to which oldestXmin will be smaller than T-1. Now, in another session, we started parallel Create Index on tab-1 which has launched one worker. The worker decided to scan page-1 and will found that the deleted tuple on page-1 is Recently Dead, so will include it in Index. In the meantime transaction, T-2 got committed/aborted which allows oldestXmin to be greater than the value of transaction T-1 and now leader decides to scan the page-2 with freshly computed oldestXmin and found that the tuple on that page is Dead and has decided not to include it in the index. So, this leads to a situation where some tuples deleted by the transaction will end up in index whereas others won't. Note that I am not arguing that there is any fundamental problem with this, but just want to highlight that such a case doesn't seem to exist with Create Index. > >> The patch uses both parallel_leader_participation and >> force_parallel_mode, but it seems the definition is different from >> what we have in Gather. Basically, even with force_parallel_mode, the >> leader is participating in parallel build. I see there is some >> discussion above about both these parameters and still, there is not >> complete agreement on the best way forward. I think we should have >> parallel_leader_participation as that can help in testing if nothing >> else. > > I think that you're quite right that parallel_leader_participation > needs to be supported for testing purposes. I had some sympathy for > the idea that we should remove leader participation as a worker from > the patch entirely, but the testing argument seems to clinch it. I'm > fine with killing force_parallel_mode, though, because it will be > possible to force the use of parallelism by using the existing > parallel_workers table storage param in the next version of the patch, > regardless of how small the table is. > Okay, this makes sense to me. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sun, Jan 14, 2018 at 8:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sun, Jan 14, 2018 at 1:43 AM, Peter Geoghegan <pg@bowt.ie> wrote: >> On Sat, Jan 13, 2018 at 4:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> Yeah, but this would mean that now with parallel create index, it is >>> possible that some tuples from the transaction would end up in index >>> and others won't. >> >> You mean some tuples from some past transaction that deleted a bunch >> of tuples and committed, but not before someone acquired a still-held >> snapshot that didn't see the deleter's transaction as committed yet? >> > > I think I am talking about something different. Let me try to explain > in some more detail. Consider a transaction T-1 has deleted two > tuples from tab-1, first on page-1 and second on page-2 and committed. > There is a parallel transaction T-2 which has an open snapshot/query > due to which oldestXmin will be smaller than T-1. Now, in another > session, we started parallel Create Index on tab-1 which has launched > one worker. The worker decided to scan page-1 and will found that the > deleted tuple on page-1 is Recently Dead, so will include it in Index. > In the meantime transaction, T-2 got committed/aborted which allows > oldestXmin to be greater than the value of transaction T-1 and now > leader decides to scan the page-2 with freshly computed oldestXmin and > found that the tuple on that page is Dead and has decided not to > include it in the index. So, this leads to a situation where some > tuples deleted by the transaction will end up in index whereas others > won't. Note that I am not arguing that there is any fundamental > problem with this, but just want to highlight that such a case doesn't > seem to exist with Create Index. I must have not done a good job of explaining myself ("You mean some tuples from some past transaction..."), because this is exactly what I meant, and was exactly how I understood your original remarks from Saturday. In summary, while I do agree that this is different to what we see with serial index builds, I still don't think that this is a concern for us. -- Peter Geoghegan
On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: > More comments: Attached patch has all open issues worked through, including those that I respond to or comment on below, as well as the other feedback from your previous e-mails. Note also that I fixed the issue that Amit raised, as well as the GetOldestXmin()-argument bug that I noticed in passing when responding to Amit. I also worked on the attribution in the commit message. Before getting to my responses to your most recent round of feedback, I want to first talk about some refactoring that I decided to do. As you can see from the master branch, tuplesort_performsort() isn't necessarily reached for spool2, even when we start out with a spool2 (that is, for many unique index builds, spool2 never even does a tuplesort_performsort()). We may instead decide to shut down spool2 when it has no (dead) tuples. I made this work just as well for the parallel case in this latest revision. I had to teach tuplesort.c to accept an early tuplesort_end() for LEADER() -- it had to be prepared to release still-waiting workers in some cases, rather than depending on nbtsort.c having called tuplesort_performsort() already. Several routines within nbtsort.c that previously knew something about parallelism now know nothing about it. This seems like a nice win. Separately, I took advantage of the fact that within the leader, its *worker* Tuplesortstate can safely call tuplesort_end() before the leader state's tuplesort_performsort() call. The overall effect of these two changes is that there is now a _bt_leader_heapscan() call for the parallel case that nicely mirrors the serial case's IndexBuildHeapScan() call, and once we're done with populating spools, no subsequent code needs to know a single thing about parallelism as a special case. You may notice some small changes to the tuplesort.h overview, which now advertises that callers can take advantage of this leeway. Now on to my responses to your most recent round of feeback... > BufFileView() looks fairly pointless. It basically creates a copy of > the input and, in so doing, destroys the input, which is a lot like > returning the input parameter except that it uses more cycles. It > does do a few things. While it certainly did occur to me that that was kind of weird, and I struggled with it on my own for a little while, I ultimately agreed with Thomas that it added something to have ltsConcatWorkerTapes() call some buffile function in every iteration of its loop. (BufFileView() + BufFileViewAppend() are code that Thomas actually wrote, though I added the asserts and comments myself.) If you think about this in terms of the interface rather than the implementation, then it may make more sense. The encapsulation adds something which might pay off later, such as when extendBufFile() needs to work with a concatenated set of BufFiles. And even right now, I cannot simply reuse the BufFile without then losing the assert that is currently in BufFileViewAppend() (must not have associated shared fileset assert). So I'd end up asserting less (rather than more) there if BufFileView() was removed. It wastes some cycles to not simply use the BufFile directly, but not terribly many in the grand scheme of things. This happens once per external sort operation. > In miscadmin.h, I'd put the prototype for the new GUC next to > max_worker_processes, not maintenance_work_mem. But then I'd really have to put it next to max_worker_processes in globals.c, too. That would mean that it would go under "Primary determinants of sizes of shared-memory structures" within globals.c, which seems wrong to me. What do you think? > The ereport() in index_build will, I think, confuse people when it > says that there are 0 parallel workers. I suggest splitting this into > two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(... > "building index \"%s\" on table \"%s\" serially" ...) else ereport(... > "building index \"%s\" on table \"%s\" in parallel with request for %d > parallel workers" ...). WFM. I've simply dropped any reference to leader participation in the messages here, to keep things simple. This seemed okay because the only thing that affects leader participation is the parallel_leader_participation GUC, which is under the user's direct control at all times, and is unlikely to be changed. Those that really want further detail have trace_sort for that. > The logic in IndexBuildHeapRangeScan() around need_register_snapshot > and OldestXmin seems convoluted and not very well-edited to me. Having revisited it, I now agree that the code added to IndexBuildHeapRangeScan() was unclear, primarily in that the need_unregister_snapshot local variable was overloaded in a weird way. > I suggest that you go back to the original code organization > and then just insert an additional case for a caller-supplied scan, so > that the overall flow looks like this: > > if (scan != NULL) > { > ... > } > else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent) > { > ... > } > else > { > ... > } The problem that I see with this alternative flow is that the "if (scan != NULL)" and the "else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)" blocks clearly must contain code for two distinct, non-overlapping cases, despite the fact those two cases actually do overlap somewhat. That is, a call to IndexBuildHeapRangeScan() may have a (parallel) heap scan argument (control reaches your first code block), or may not (control reaches your second or third code block). At the same time, a call to IndexBuildHeapRangeScan() may use SnapShotAny (ordinary CREATE INDEX), or may need an MVCC snapshot (either by registering its own, or using the parallel one). These two things are orthogonal. I think I still get the gist of what you're saying, though. I've come up with a new structure that is a noticeable improvement on what I had. Importantly, the new structure let me add a number of parallelism-agnostic asserts that make sure that every ambuild routine that supports parallelism gets the details right. > Along with that, I'd change the name of need_register_snapshot to > need_unregister_snapshot (it's doing both jobs right now) and > initialize it to false. Done. > + * This support code isn't reliable when called from within a parallel > + * worker process due to the fact that our state isn't propagated. This is > + * why parallel index builds are disallowed on catalogs. It is possible > + * that we'll fail to catch an attempted use of a user index undergoing > + * reindexing due the non-propagation of this state to workers, which is not > + * ideal, but the problem is not particularly likely to go undetected due to > + * our not doing better there. > > I understand the first two sentences, but I have no idea what the > third one means, especially the part that says "not particularly > likely to go undetected due to our not doing better there". It sounds > scary that something bad is only "not particularly likely to go > undetected"; don't we need to detect bad things reliably? The primary point here, that you said you understood, is that we definitely need to detect when we're reindexing a catalog index within the backend, so that systable_beginscan() can do the right thing and not use the index (we also must avoid assertion failures). My solution to that problem is, of course, to not allow the use of parallel create index when REINDEXing a system catalog. That seems 100% fine to me. There is a little bit of ambiguity about other cases, though -- that's the secondary point I tried to make within that comment block, and the part that you took issue with. To put this secondary point another way: It's possible that we'd fail to detect it if someone's comparator went bananas and decided it was okay to do SQL access (that resulted in an index scan of the index undergoing reindex). That does seem rather unlikely, but I felt it necessary to say something like this because ReindexIsProcessingIndex() isn't already something that only deals with catalog indexes -- it works with all indexes. Anyway, I reworded this. I hope that what I came up with is clearer than before. > But also, > you used the word "not" three times and also the prefix "un-", meaning > "not", once. Four negations in 13 words! Perhaps I'm not entirely in > a position to cast aspersions on overly-complex phraseology -- the pot > calling the kettle black and all that -- but I bet that will be a lot > clearer if you reduce the number of negations to either 0 or 1. You're not wrong. Simplified. > The comment change in standard_planner() doesn't look helpful to me; > I'd leave it out. Okay. > + * tableOid is the table that index is to be built on. indexOid is the OID > + * of a index to be created or reindexed (which must be a btree index). > > I'd rewrite that first sentence to end "the table on which the index > is to be built". The second sentence should say "an index" rather > than "a index". Okay. > But, actually, I think we would be better off just ripping > leaderWorker/leaderParticipates out of this function altogether. > compute_parallel_worker() is not really under any illusion that it's > computing a number of *participants*; it's just computing a number of > *workers*. That distinction does seem to cause plenty of confusion. While I accept what you say about compute_parallel_worker(), I still haven't gone as far as removing the leaderParticipates argument altogether, because compute_parallel_worker() isn't the only thing that matters here. (More on that below.) > I think it's fine for > parallel_leader_participation=off to simply mean that you get one > fewer participants. That's actually what would happen with parallel > query, too. Parallel query would consider > parallel_leader_participation later, in get_parallel_divisor(), when > working out the cost of one path vs. another, but it doesn't use it to > choose the number of workers. So it seems to me that getting rid of > all of the workerLeader considerations will make it both simpler and > more consistent with what we do for queries. I was aware of those details, and figured that parallel query fudges the compute_parallel_worker() figure's leader participation in some sense, and that that was what I needed to compensate for. After all, when parallel_leader_participation=off, having compute_parallel_worker() return 1 means rather a different thing to what it means with parallel_leader_participation=on, even though in general we seem to assume that parallel_leader_participation can only make a small difference overall. Here's what I've done based on your feedback: I've changed the header comments, but stopped leaderParticipates from affecting the compute_parallel_worker() calculation (so, as I said, leaderParticipates stays). The leaderParticipates argument continues to affect these two aspects of plan_create_index_workers()'s return value: 1. It continues to be used so we have a total number of participants (not workers) to apply our must-have-32MB-workMem limit on participants. Parallel query has no equivalent of this, and it seems warranted. Note that this limit is no longer applied when parallel_workers storage param was set, as discussed. 2. I continue to use the leaderParticipates argument to disallow the case where there is only one CREATE INDEX participant but parallelism is in use, because, of course, that clearly makes no sense -- we should just use a serial sort instead. (It might make sense to allow this if parallel_leader_participation was *purely* a testing GUC, only for use by by backend hackers, but AFAICT it isn't.) The planner can allow a single participant parallel sequential scan path to be created without worrying about the fact that that doesn't make much sense, because a plan with only one parallel participant is always going to cost more than some serial plan (you will only get a 1 participant parallel sequential scan when force_parallel_mode is on). Obviously plan_create_index_workers() doesn't generate (partial) paths at all, so I simply have to get the same outcome (avoiding a senseless 1 participant parallel operation) some other way here. > If you have an idea how to make a better > cost model than this for CREATE INDEX, I'm willing to consider other > options. If you don't, or want to propose that as a follow-up patch, > then I think it's OK to use what you've got here for starters. I just > don't want it to be more baroque than necessary. I suspect that the parameters of any cost model for parallel CREATE INDEX that we're prepared to consider for v11 are: "Use a number of parallel workers that is one below the number at which the total duration of the CREATE INDEX either stays the same or goes up". It's hard to do much better than this within those parameters. I can see a fairly noticeable benefit to parallelism with 4 parallel workers and a measly 1MB of maintenance_work_mem (when parallelism is forced) relative to the serial case with the same amount of memory. At least on my laptop, it seems to be rather hard to lose relative to a serial sort when using parallel CREATE INDEX (to be fair, I'm probably actually using way more memory than 1MB to do this due to FS cache usage). I can think of a cleverer approach to costing parallel CREATE INDEX, but it's only cleverer by weighing distributed costs. Not very relevant, for the time being. BTW, the 32MB per participant limit within plan_create_index_workers() was chosen based on the realization that any higher value would make having a default setting of 2 for max_parallel_maintenance_workers (to match the max_parallel_workers_per_gather default) pointless when the default maintenance_work_mem value of 64MB is in use. That's not terribly scientific, though it at least doesn't come at the expense of a more scientific idea for a limit like that (I don't actually have one, you see). I am merely trying to avoid being *gratuitously* wasteful of shared resources that are difficult to accurately cost in (e.g., the distributed cost of random I/O to the system as a whole when we do a parallel index build while ridiculously low on maintenance_work_mem). > I think that the naming of the wait events could be improved. Right > now, they are named by which kind of process does the waiting, but it > really should be named based on what the thing for which we're > waiting. I also suggest that we could just write Sort instead of > Tuplesort. In short, I suggest ParallelTuplesortLeader -> > ParallelSortWorkersDone and ParallelTuplesortLeader -> > ParallelSortTapeHandover. WFM. Also added documentation for the wait events to monitoring.sgml, which I somehow missed the first time around. > Not for this patch, but I wonder if it might be a worthwhile future > optimization to allow workers to return multiple tapes to the leader. > One doesn't want to go crazy with this, of course. If the worker > returns 100 tapes, then the leader might get stuck doing multiple > merge passes, which would be a foolish way to divide up the labor, and > even if that doesn't happen, Amdahl's law argues for minimizing the > amount of work that is not done in parallel. Still, what if a worker > (perhaps after merging) ends up with 2 or 3 tapes? Is it really worth > merging them so that the leader can do a 5-way merge instead of a > 15-way merge? I did think about this myself, or rather I thought specifically about building a serial/bigserial PK during pg_restore, a case that must be very common. The worker merges for such an index build will typically be *completely pointless* when all input runs are in sorted order, because the merge heap will only need to consult the root of the heap and its two immediate children throughout (commit 24598337c helped cases like this enormously). You might as well merge hundreds of runs in the leader, provided you still have enough memory per tape that you can get the full benefit of OS readahead (this is not that hard when you're only going to repeatedly read from the same tape anyway). I'm not too worried about it, though. The overall picture is still very positive even in this case. The "extra worker merging" isn't generally a big proportion of the overall cost, especially there. More importantly, if I tried to do better, it would be the "quicksort with spillover" cost model story all over again (remember how tedious that was?). How hard are we prepared to work to ensure that we get it right when it comes to skipping worker merging, given that users always pay some overhead, even when that doesn't happen? Note also that parallel index builds manage to unfairly *gain* advantage over serial cases (they have the good variety of dumb luck, rather than the bad variety) in certain other common cases. This happens with an *inverse* physical/logical correlation (e.g. a DESC index builds on a date field). They manage to artificially do better than theory would predict, simply because a greater number of smaller quicksorts are much faster during initial run generation, without also taking a concomitant performance hit at merge time. Thomas showed this at one point. Note that even that's only true because of the qsort precheck (what I like to call the "banana skin prone" precheck, that we added to our qsort implementation in 2006) -- it would be true for *all* correlations, but that one precheck thing complicates matters. All of this is a tricky business, and that isn't going to get any easier IMV. > + * Make sure that the temp file(s) underlying the tape set are created in > + * suitable temp tablespaces. This is only really needed for serial > + * sorts. > > This comment makes me wonder whether it is "sorta" needed for parallel sorts. I removed "really". The point of the comment is that we've already set up temp tablespaces for the shared fileset in the parallel case. Shared filesets figure out which tablespaces will be used up-front -- see SharedFileSetInit(). > - if (trace_sort) > + if (trace_sort && !WORKER(state)) > > I have a feeling we still want to get this output even from workers, > but maybe I'm missing something. I updated tuplesort_end() so that trace_sort reports on the end of the sort, even for worker processes. (We still don't show generic tuplesort_begin* message for workers, though.) > + arg5 indicates serial, parallel worker, or parallel leader sort.</entry> > > I think it should say what values are used for each case. I based this on "arg0 indicates heap, index or datum sort", where it's implied that the values are respective to the order that they appear in in the sentence (starting from 0). But okay, I'll do it that way all the same. > + /* Release worker tuplesorts within leader process as soon as possible */ > > IIUC, the worker tuplesorts aren't really holding onto much of > anything in terms of resources. I think it might be better to phrase > this as /* The sort we just did absorbed the final tapes produced by > these tuplesorts, which are of no further use. */ or words to that > effect. Okay. Done that way. > Instead of making a special case in CreateParallelContext for > serializable_okay, maybe index_build should just use SetConfigOption() > to force the isolation level to READ COMMITTED right after it does > NewGUCNestLevel(). The change would only be temporary because the > subsequent call to AtEOXact_GUC() will revert it. I tried doing it that way, but it doesn't seem workable: postgres=# begin transaction isolation level serializable ; BEGIN postgres=*# reindex index test_unique; ERROR: 25001: SET TRANSACTION ISOLATION LEVEL must be called before any query LOCATION: call_string_check_hook, guc.c:9953 Note that AutoVacLauncherMain() uses SetConfigOption() to set/modify default_transaction_isolation -- not transaction_isolation. Instead, I added a bit more to comments within CreateParallelContext(), to justify what I've done along the lines you went into. Hopefully this works better for you. > There is *still* more to review here, but my concentration is fading. > If you could post an updated patch after adjusting for the comments > above, I think that would be helpful. I'm not totally out of things > to review that I haven't already looked over once, but I think I'm > close. I'm impressed with how quickly you're getting through review of the patch. Hopefully we can keep that momentum up. Thanks -- Peter Geoghegan
Attachment
Hi all,
I have been continue doing testing of parallel create index patch. So far
I haven't came across any sort of issue or regression with the patches.
Here are few performance number for the latest round of testing - which
is perform on top of 6th Jan patch submitted by Peter.
Testing is done on openstack instance with:
CUP: 8
RAM : 16GB
HD: 640 GB
postgres=# select pg_size_pretty(pg_total_relation_size
('lineitem'));
pg_size_pretty
----------------
93 GB
(1 row)
-- Test 1.
max_parallel_workers_maintenance = 2
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 1GB
max_wal_size = 4GB
-- Test 2.
max_parallel_workers_maintenance = 4
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 2GB
max_wal_size = 4GB
-- Test 3.
max_parallel_workers_maintenance = 8
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 4GB
max_wal_size = 4GB
NOTE: All the time taken entries are the median of 3 consecutive runs for the same B-tree index creation query.
Time taken for Parallel Index createion: | |||||||||
Test 1 | Test 2 | Test 3 | |||||||
Simple/Composite Indexes: | Without patch | With patch , max_parallel_workers_maintenance = 2 | % Change | Without patch | With patch, max_parallel_workers_maintenance = 4 | % Change | Without patch | With patch, max_parallel_workers_maintenance = 8 | % Change |
Index on "bigint" column: CREATE INDEX li_ordkey_idx1 ON lineitem(l_orderkey); | 1062446.462 ms (17:42.446) | 1024972.273 ms (17:04.972) | 3.52 % | 1053468.945 ms (17:33.469) | 896375.543 ms (14:56.376) | 17.75 % | 1082920.703 ms (18:02.921) | 932550.058 ms (15:32.550) | 13.88 % |
index on "integer" column: CREATE INDEX li_lineno_idx2 ON lineitem(l_linenumber); | 1538285.499 ms (25:38.285) | 1201008.423 ms (20:01.008) | 21.92 % | 1529837.023 ms (25:29.837) | 1014188.489 ms (16:54.188) | 33.70 % | 1642160.947 ms (27:22.161) | 978518.253 ms (16:18.518) | 40.41 % |
index on "numeric" column: CREATE INDEX li_qty_idx3 ON lineitem(l_quantity); | 3968102.568 ms (01:06:08.103) | 2359304.405 ms (39:19.304) | 40.54 % | 4129510.930 ms (01:08:49.511) | 1680201.644 ms (28:00.202) | 59.31 % | 4348248.210 ms (01:12:28.248) | 1490461.879 ms (24:50.462) | 65.72 % |
index on "character" column: CREATE INDEX li_lnst_idx4 ON lineitem(l_linestatus); | 1510273.931 ms (25:10.274) | 1240265.301 ms (20:40.265) | 17.87 % | 1516842.985 ms (25:16.843) | 995730.092 ms (16:35.730) | 34.35 % | 1580789.375 ms (26:20.789) | 984975.746 ms (16:24.976) | 37.69 % |
index on "date" column: CREATE INDEX li_shipdt_idx5 ON lineitem(l_shipdate); | 1483603.274 ms (24:43.603) | 1189704.930 ms (19:49.705) | 19.80 % | 1498348.925 ms (24:58.349) | 1040421.626 ms (17:20.422) | 30.56 % | 1653651.499 ms (27:33.651) | 1016305.794 ms (16:56.306) | 38.54 % |
index on "character varying" column: CREATE INDEX li_comment_idx6 ON lineitem(l_comment); | 6945953.838 ms (01:55:45.954) | 4329696.334 ms (01:12:09.696) | 37.66 % | 6818556.437 ms (01:53:38.556) | 2834034.054 ms (47:14.034) | 58.43 % | 6942285.711 ms (01:55:42.286) | 2648430.902 ms (44:08.431) | 61.85 % |
composite index on "numeric", "character" columns: CREATE INDEX li_qtylnst_idx34 ON lineitem (l_quantity, l_linestatus); | 4961563.400 ms (01:22:41.563) | 2959722.178 ms (49:19.722) | 40.34 % | 5242809.501 ms (01:27:22.810) | 2077463.136 ms (34:37.463) | 60.37 % | 5576765.727 ms (01:32:56.766) | 1755829.420 ms (29:15.829) | 68.51 % |
composite index on "date", "character varying" columns: CREATE INDEX li_shipdtcomment_idx56 ON lineitem (l_shipdate, l_comment); | 4693318.077 ms (01:18:13.318) | 3181494.454 ms (53:01.494) | 32.21 % | 4627624.682 ms (01:17:07.625) | 2613289.211 ms (43:33.289) | 43.52 % | 4719242.965 ms (01:18:39.243) | 2685516.832 ms (44:45.517) | 43.09 % |
On Tue, Jan 16, 2018 at 6:24 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> More comments:
Attached patch has all open issues worked through, including those
that I respond to or comment on below, as well as the other feedback
from your previous e-mails. Note also that I fixed the issue that Amit
raised, as well as the GetOldestXmin()-argument bug that I noticed in
passing when responding to Amit. I also worked on the attribution in
the commit message.
Before getting to my responses to your most recent round of feedback,
I want to first talk about some refactoring that I decided to do. As
you can see from the master branch, tuplesort_performsort() isn't
necessarily reached for spool2, even when we start out with a spool2
(that is, for many unique index builds, spool2 never even does a
tuplesort_performsort()). We may instead decide to shut down spool2
when it has no (dead) tuples. I made this work just as well for the
parallel case in this latest revision. I had to teach tuplesort.c to
accept an early tuplesort_end() for LEADER() -- it had to be prepared
to release still-waiting workers in some cases, rather than depending
on nbtsort.c having called tuplesort_performsort() already. Several
routines within nbtsort.c that previously knew something about
parallelism now know nothing about it. This seems like a nice win.
Separately, I took advantage of the fact that within the leader, its
*worker* Tuplesortstate can safely call tuplesort_end() before the
leader state's tuplesort_performsort() call.
The overall effect of these two changes is that there is now a
_bt_leader_heapscan() call for the parallel case that nicely mirrors
the serial case's IndexBuildHeapScan() call, and once we're done with
populating spools, no subsequent code needs to know a single thing
about parallelism as a special case. You may notice some small changes
to the tuplesort.h overview, which now advertises that callers can
take advantage of this leeway.
Now on to my responses to your most recent round of feeback...
> BufFileView() looks fairly pointless. It basically creates a copy of
> the input and, in so doing, destroys the input, which is a lot like
> returning the input parameter except that it uses more cycles. It
> does do a few things.
While it certainly did occur to me that that was kind of weird, and I
struggled with it on my own for a little while, I ultimately agreed
with Thomas that it added something to have ltsConcatWorkerTapes()
call some buffile function in every iteration of its loop.
(BufFileView() + BufFileViewAppend() are code that Thomas actually
wrote, though I added the asserts and comments myself.)
If you think about this in terms of the interface rather than the
implementation, then it may make more sense. The encapsulation adds
something which might pay off later, such as when extendBufFile()
needs to work with a concatenated set of BufFiles. And even right now,
I cannot simply reuse the BufFile without then losing the assert that
is currently in BufFileViewAppend() (must not have associated shared
fileset assert). So I'd end up asserting less (rather than more) there
if BufFileView() was removed.
It wastes some cycles to not simply use the BufFile directly, but not
terribly many in the grand scheme of things. This happens once per
external sort operation.
> In miscadmin.h, I'd put the prototype for the new GUC next to
> max_worker_processes, not maintenance_work_mem.
But then I'd really have to put it next to max_worker_processes in
globals.c, too. That would mean that it would go under "Primary
determinants of sizes of shared-memory structures" within globals.c,
which seems wrong to me. What do you think?
> The ereport() in index_build will, I think, confuse people when it
> says that there are 0 parallel workers. I suggest splitting this into
> two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(...
> "building index \"%s\" on table \"%s\" serially" ...) else ereport(...
> "building index \"%s\" on table \"%s\" in parallel with request for %d
> parallel workers" ...).
WFM. I've simply dropped any reference to leader participation in the
messages here, to keep things simple. This seemed okay because the
only thing that affects leader participation is the
parallel_leader_participation GUC, which is under the user's direct
control at all times, and is unlikely to be changed. Those that really
want further detail have trace_sort for that.
> The logic in IndexBuildHeapRangeScan() around need_register_snapshot
> and OldestXmin seems convoluted and not very well-edited to me.
Having revisited it, I now agree that the code added to
IndexBuildHeapRangeScan() was unclear, primarily in that the
need_unregister_snapshot local variable was overloaded in a weird way.
> I suggest that you go back to the original code organization
> and then just insert an additional case for a caller-supplied scan, so
> that the overall flow looks like this:
>
> if (scan != NULL)
> {
> ...
> }
> else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)
> {
> ...
> }
> else
> {
> ...
> }
The problem that I see with this alternative flow is that the "if
(scan != NULL)" and the "else if (IsBootstrapProcessingMode() ||
indexInfo->ii_Concurrent)" blocks clearly must contain code for two
distinct, non-overlapping cases, despite the fact those two cases
actually do overlap somewhat. That is, a call to
IndexBuildHeapRangeScan() may have a (parallel) heap scan argument
(control reaches your first code block), or may not (control reaches
your second or third code block). At the same time, a call to
IndexBuildHeapRangeScan() may use SnapShotAny (ordinary CREATE INDEX),
or may need an MVCC snapshot (either by registering its own, or using
the parallel one). These two things are orthogonal.
I think I still get the gist of what you're saying, though. I've come
up with a new structure that is a noticeable improvement on what I
had. Importantly, the new structure let me add a number of
parallelism-agnostic asserts that make sure that every ambuild routine
that supports parallelism gets the details right.
> Along with that, I'd change the name of need_register_snapshot to
> need_unregister_snapshot (it's doing both jobs right now) and
> initialize it to false.
Done.
> + * This support code isn't reliable when called from within a parallel
> + * worker process due to the fact that our state isn't propagated. This is
> + * why parallel index builds are disallowed on catalogs. It is possible
> + * that we'll fail to catch an attempted use of a user index undergoing
> + * reindexing due the non-propagation of this state to workers, which is not
> + * ideal, but the problem is not particularly likely to go undetected due to
> + * our not doing better there.
>
> I understand the first two sentences, but I have no idea what the
> third one means, especially the part that says "not particularly
> likely to go undetected due to our not doing better there". It sounds
> scary that something bad is only "not particularly likely to go
> undetected"; don't we need to detect bad things reliably?
The primary point here, that you said you understood, is that we
definitely need to detect when we're reindexing a catalog index within
the backend, so that systable_beginscan() can do the right thing and
not use the index (we also must avoid assertion failures). My solution
to that problem is, of course, to not allow the use of parallel create
index when REINDEXing a system catalog. That seems 100% fine to me.
There is a little bit of ambiguity about other cases, though -- that's
the secondary point I tried to make within that comment block, and the
part that you took issue with. To put this secondary point another
way: It's possible that we'd fail to detect it if someone's comparator
went bananas and decided it was okay to do SQL access (that resulted
in an index scan of the index undergoing reindex). That does seem
rather unlikely, but I felt it necessary to say something like this
because ReindexIsProcessingIndex() isn't already something that only
deals with catalog indexes -- it works with all indexes.
Anyway, I reworded this. I hope that what I came up with is clearer than before.
> But also,
> you used the word "not" three times and also the prefix "un-", meaning
> "not", once. Four negations in 13 words! Perhaps I'm not entirely in
> a position to cast aspersions on overly-complex phraseology -- the pot
> calling the kettle black and all that -- but I bet that will be a lot
> clearer if you reduce the number of negations to either 0 or 1.
You're not wrong. Simplified.
> The comment change in standard_planner() doesn't look helpful to me;
> I'd leave it out.
Okay.
> + * tableOid is the table that index is to be built on. indexOid is the OID
> + * of a index to be created or reindexed (which must be a btree index).
>
> I'd rewrite that first sentence to end "the table on which the index
> is to be built". The second sentence should say "an index" rather
> than "a index".
Okay.
> But, actually, I think we would be better off just ripping
> leaderWorker/leaderParticipates out of this function altogether.
> compute_parallel_worker() is not really under any illusion that it's
> computing a number of *participants*; it's just computing a number of
> *workers*.
That distinction does seem to cause plenty of confusion. While I
accept what you say about compute_parallel_worker(), I still haven't
gone as far as removing the leaderParticipates argument altogether,
because compute_parallel_worker() isn't the only thing that matters
here. (More on that below.)
> I think it's fine for
> parallel_leader_participation=off to simply mean that you get one
> fewer participants. That's actually what would happen with parallel
> query, too. Parallel query would consider
> parallel_leader_participation later, in get_parallel_divisor(), when
> working out the cost of one path vs. another, but it doesn't use it to
> choose the number of workers. So it seems to me that getting rid of
> all of the workerLeader considerations will make it both simpler and
> more consistent with what we do for queries.
I was aware of those details, and figured that parallel query fudges
the compute_parallel_worker() figure's leader participation in some
sense, and that that was what I needed to compensate for. After all,
when parallel_leader_participation=off, having
compute_parallel_worker() return 1 means rather a different thing to
what it means with parallel_leader_participation=on, even though in
general we seem to assume that parallel_leader_participation can only
make a small difference overall.
Here's what I've done based on your feedback: I've changed the header
comments, but stopped leaderParticipates from affecting the
compute_parallel_worker() calculation (so, as I said,
leaderParticipates stays). The leaderParticipates argument continues
to affect these two aspects of plan_create_index_workers()'s return
value:
1. It continues to be used so we have a total number of participants
(not workers) to apply our must-have-32MB-workMem limit on
participants.
Parallel query has no equivalent of this, and it seems warranted. Note
that this limit is no longer applied when parallel_workers storage
param was set, as discussed.
2. I continue to use the leaderParticipates argument to disallow the
case where there is only one CREATE INDEX participant but parallelism
is in use, because, of course, that clearly makes no sense -- we
should just use a serial sort instead.
(It might make sense to allow this if parallel_leader_participation
was *purely* a testing GUC, only for use by by backend hackers, but
AFAICT it isn't.)
The planner can allow a single participant parallel sequential scan
path to be created without worrying about the fact that that doesn't
make much sense, because a plan with only one parallel participant is
always going to cost more than some serial plan (you will only get a 1
participant parallel sequential scan when force_parallel_mode is on).
Obviously plan_create_index_workers() doesn't generate (partial) paths
at all, so I simply have to get the same outcome (avoiding a senseless
1 participant parallel operation) some other way here.
> If you have an idea how to make a better
> cost model than this for CREATE INDEX, I'm willing to consider other
> options. If you don't, or want to propose that as a follow-up patch,
> then I think it's OK to use what you've got here for starters. I just
> don't want it to be more baroque than necessary.
I suspect that the parameters of any cost model for parallel CREATE
INDEX that we're prepared to consider for v11 are: "Use a number of
parallel workers that is one below the number at which the total
duration of the CREATE INDEX either stays the same or goes up".
It's hard to do much better than this within those parameters. I can
see a fairly noticeable benefit to parallelism with 4 parallel workers
and a measly 1MB of maintenance_work_mem (when parallelism is forced)
relative to the serial case with the same amount of memory. At least
on my laptop, it seems to be rather hard to lose relative to a serial
sort when using parallel CREATE INDEX (to be fair, I'm probably
actually using way more memory than 1MB to do this due to FS cache
usage). I can think of a cleverer approach to costing parallel CREATE
INDEX, but it's only cleverer by weighing distributed costs. Not very
relevant, for the time being.
BTW, the 32MB per participant limit within plan_create_index_workers()
was chosen based on the realization that any higher value would make
having a default setting of 2 for max_parallel_maintenance_workers (to
match the max_parallel_workers_per_gather default) pointless when the
default maintenance_work_mem value of 64MB is in use. That's not
terribly scientific, though it at least doesn't come at the expense of
a more scientific idea for a limit like that (I don't actually have
one, you see). I am merely trying to avoid being *gratuitously*
wasteful of shared resources that are difficult to accurately cost in
(e.g., the distributed cost of random I/O to the system as a whole
when we do a parallel index build while ridiculously low on
maintenance_work_mem).
> I think that the naming of the wait events could be improved. Right
> now, they are named by which kind of process does the waiting, but it
> really should be named based on what the thing for which we're
> waiting. I also suggest that we could just write Sort instead of
> Tuplesort. In short, I suggest ParallelTuplesortLeader ->
> ParallelSortWorkersDone and ParallelTuplesortLeader ->
> ParallelSortTapeHandover.
WFM. Also added documentation for the wait events to monitoring.sgml,
which I somehow missed the first time around.
> Not for this patch, but I wonder if it might be a worthwhile future
> optimization to allow workers to return multiple tapes to the leader.
> One doesn't want to go crazy with this, of course. If the worker
> returns 100 tapes, then the leader might get stuck doing multiple
> merge passes, which would be a foolish way to divide up the labor, and
> even if that doesn't happen, Amdahl's law argues for minimizing the
> amount of work that is not done in parallel. Still, what if a worker
> (perhaps after merging) ends up with 2 or 3 tapes? Is it really worth
> merging them so that the leader can do a 5-way merge instead of a
> 15-way merge?
I did think about this myself, or rather I thought specifically about
building a serial/bigserial PK during pg_restore, a case that must be
very common. The worker merges for such an index build will typically
be *completely pointless* when all input runs are in sorted order,
because the merge heap will only need to consult the root of the heap
and its two immediate children throughout (commit 24598337c helped
cases like this enormously). You might as well merge hundreds of runs
in the leader, provided you still have enough memory per tape that you
can get the full benefit of OS readahead (this is not that hard when
you're only going to repeatedly read from the same tape anyway).
I'm not too worried about it, though. The overall picture is still
very positive even in this case. The "extra worker merging" isn't
generally a big proportion of the overall cost, especially there. More
importantly, if I tried to do better, it would be the "quicksort with
spillover" cost model story all over again (remember how tedious that
was?). How hard are we prepared to work to ensure that we get it right
when it comes to skipping worker merging, given that users always pay
some overhead, even when that doesn't happen?
Note also that parallel index builds manage to unfairly *gain*
advantage over serial cases (they have the good variety of dumb luck,
rather than the bad variety) in certain other common cases. This
happens with an *inverse* physical/logical correlation (e.g. a DESC
index builds on a date field). They manage to artificially do better
than theory would predict, simply because a greater number of smaller
quicksorts are much faster during initial run generation, without also
taking a concomitant performance hit at merge time. Thomas showed this
at one point. Note that even that's only true because of the qsort
precheck (what I like to call the "banana skin prone" precheck, that
we added to our qsort implementation in 2006) -- it would be true for
*all* correlations, but that one precheck thing complicates matters.
All of this is a tricky business, and that isn't going to get any easier IMV.
> + * Make sure that the temp file(s) underlying the tape set are created in
> + * suitable temp tablespaces. This is only really needed for serial
> + * sorts.
>
> This comment makes me wonder whether it is "sorta" needed for parallel sorts.
I removed "really". The point of the comment is that we've already set
up temp tablespaces for the shared fileset in the parallel case.
Shared filesets figure out which tablespaces will be used up-front --
see SharedFileSetInit().
> - if (trace_sort)
> + if (trace_sort && !WORKER(state))
>
> I have a feeling we still want to get this output even from workers,
> but maybe I'm missing something.
I updated tuplesort_end() so that trace_sort reports on the end of the
sort, even for worker processes. (We still don't show generic
tuplesort_begin* message for workers, though.)
> + arg5 indicates serial, parallel worker, or parallel leader sort.</entry>
>
> I think it should say what values are used for each case.
I based this on "arg0 indicates heap, index or datum sort", where it's
implied that the values are respective to the order that they appear
in in the sentence (starting from 0). But okay, I'll do it that way
all the same.
> + /* Release worker tuplesorts within leader process as soon as possible */
>
> IIUC, the worker tuplesorts aren't really holding onto much of
> anything in terms of resources. I think it might be better to phrase
> this as /* The sort we just did absorbed the final tapes produced by
> these tuplesorts, which are of no further use. */ or words to that
> effect.
Okay. Done that way.
> Instead of making a special case in CreateParallelContext for
> serializable_okay, maybe index_build should just use SetConfigOption()
> to force the isolation level to READ COMMITTED right after it does
> NewGUCNestLevel(). The change would only be temporary because the
> subsequent call to AtEOXact_GUC() will revert it.
I tried doing it that way, but it doesn't seem workable:
postgres=# begin transaction isolation level serializable ;
BEGIN
postgres=*# reindex index test_unique;
ERROR: 25001: SET TRANSACTION ISOLATION LEVEL must be called before any query
LOCATION: call_string_check_hook, guc.c:9953
Note that AutoVacLauncherMain() uses SetConfigOption() to set/modify
default_transaction_isolation -- not transaction_isolation.
Instead, I added a bit more to comments within
CreateParallelContext(), to justify what I've done along the lines you
went into. Hopefully this works better for you.
> There is *still* more to review here, but my concentration is fading.
> If you could post an updated patch after adjusting for the comments
> above, I think that would be helpful. I'm not totally out of things
> to review that I haven't already looked over once, but I think I'm
> close.
I'm impressed with how quickly you're getting through review of the
patch. Hopefully we can keep that momentum up.
Thanks
--
Peter Geoghegan
On Tue, Jan 16, 2018 at 6:24 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> More comments: > > Attached patch has all open issues worked through, including those > that I respond to or comment on below, as well as the other feedback > from your previous e-mails. Note also that I fixed the issue that Amit > raised, > I could still reproduce it. I think the way you have fixed it has a race condition. In _bt_parallel_scan_and_sort(), the value of brokenhotchain is set after you signal the leader that the worker is done (by incrementing workersFinished). Now, the leader is free to decide based on the current shared state which can give the wrong value. Similarly, I think the value of havedead and reltuples can also be wrong. You neither seem to have fixed nor responded to the second problem mentioned in my email upthread [1]. To reiterate, the problem is that we can't assume that the workers we have launched will always start and finish. It is possible that postmaster fails to start the worker due to fork failure. In such conditions, tuplesort_leader_wait will hang indefinitely because it will wait for the workersFinished count to become equal to launched workers (+1, if leader participates) which will never happen. Am I missing something due to which this won't be a problem? Now, I think one argument is that such a problem can happen in a parallel query, so it is not the responsibility of this patch to solve it. However, we already have a patch (there are some review comments that needs to be addressed in the proposed patch) to solve it and this patch is adding a new path in the code which has similar symptoms which can't be fixed with the already proposed patch. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BizMyxzFD6k81Deyar35YJ5qdpbRTUp9cQvo%2BniQom7Q%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 17, 2018 at 5:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I could still reproduce it. I think the way you have fixed it has a > race condition. In _bt_parallel_scan_and_sort(), the value of > brokenhotchain is set after you signal the leader that the worker is > done (by incrementing workersFinished). Now, the leader is free to > decide based on the current shared state which can give the wrong > value. Similarly, I think the value of havedead and reltuples can > also be wrong. > You neither seem to have fixed nor responded to the second problem > mentioned in my email upthread [1]. To reiterate, the problem is that > we can't assume that the workers we have launched will always start > and finish. It is possible that postmaster fails to start the worker > due to fork failure. In such conditions, tuplesort_leader_wait will > hang indefinitely because it will wait for the workersFinished count > to become equal to launched workers (+1, if leader participates) which > will never happen. Am I missing something due to which this won't be > a problem? I think that both problems (the live _bt_parallel_scan_and_sort() bug, as well as the general issue with needing to account for parallel worker fork() failure) are likely solvable by not using tuplesort_leader_wait(), and instead calling WaitForParallelWorkersToFinish(). Which you suggested already. Separately, I will need to monitor that bugfix patch, and check its progress, to make sure that what I add is comparable to what ultimately gets committed for parallel query. -- Peter Geoghegan
On Wed, Jan 17, 2018 at 12:27 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I think that both problems (the live _bt_parallel_scan_and_sort() bug, > as well as the general issue with needing to account for parallel > worker fork() failure) are likely solvable by not using > tuplesort_leader_wait(), and instead calling > WaitForParallelWorkersToFinish(). Which you suggested already. I'm wondering if this shouldn't instead be handled by using the new Barrier facilities. I think it would work like this: - leader calls BarrierInit(..., 0) - leader calls BarrierAttach() before starting workers. - each worker, before reading anything from the parallel scan, calls BarrierAttach(). if the phase returned is greater than 0, then the worker arrived at the barrier after all the work was done, and should exit immediately. - each worker, after finishing sorting, calls BarrierArriveAndWait(). leader, after sorting, also calls BarrierArriveAndWait(). - when BarrierArriveAndWait() returns in the leader, all workers that actually started (and did so quickly enough) have arrived at the barrier. The leader can now do leader_takeover_tapes, being careful to adopt only the tapes actually created, since some workers may have failed to launch or launched only after sorting was already complete. - meanwhile, the workers again call BarrierArriveAndWait(). - after it's done taking over tapes, the leader calls BarrierDetach(), releasing the workers. - the workers call BarrierDetach() and then exit -- or maybe they don't even really need to detach So the barrier phase numbers would have the following meanings: 0 - sorting 1 - taking over tapes 2 - done This could be slightly more elegant if BarrierArriveAndWait() had an additional argument indicating the phase number for which the backend could wait, or maybe the number of phases for which it should wait. Then, the workers could avoid having to call BarrierArriveAndWait() twice in a row. While I find the Barrier API slightly confusing -- and I suspect I'm not entirely alone -- I don't think that's a good excuse for reinventing the wheel. The problem of needing to wait for every process that does A (in this case, read tuples from the scan) to also do B (in this case, finish sorting those tuples) is a very general one that is deserving of a general solution. Unless somebody comes up with a better plan, Barrier seems to be the way to do that in PostgreSQL. I don't think using WaitForParallelWorkersToFinish() is a good idea. That would require workers to hold onto their tuplesorts until after losing the ability to send messages to the leader, which doesn't sound like a very good plan. We don't want workers to detach from their error queues until the bitter end, lest errors go unreported. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 15, 2018 at 7:54 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> BufFileView() looks fairly pointless. It basically creates a copy of >> the input and, in so doing, destroys the input, which is a lot like >> returning the input parameter except that it uses more cycles. It >> does do a few things. > > While it certainly did occur to me that that was kind of weird, and I > struggled with it on my own for a little while, I ultimately agreed > with Thomas that it added something to have ltsConcatWorkerTapes() > call some buffile function in every iteration of its loop. > (BufFileView() + BufFileViewAppend() are code that Thomas actually > wrote, though I added the asserts and comments myself.) Hmm, well, if Thomas contributed code to this patch, then he needs to be listed as an author. I went searching for an email on this thread (or any other) where he posted code for this, thinking that there might be some discussion explaining the motivation, but I didn't find any. I'm still in favor of erasing this distinction. > If you think about this in terms of the interface rather than the > implementation, then it may make more sense. The encapsulation adds > something which might pay off later, such as when extendBufFile() > needs to work with a concatenated set of BufFiles. And even right now, > I cannot simply reuse the BufFile without then losing the assert that > is currently in BufFileViewAppend() (must not have associated shared > fileset assert). So I'd end up asserting less (rather than more) there > if BufFileView() was removed. I would see the encapsulation as having some value if the original BufFile remained valid and the new view were also valid. Then the BufFileView operation is a bit like a copy-on-write filesystem snapshot: you have the original, which you can do stuff with, and you have a copy, which can be manipulated independently, but the copying is cheap. But here the BufFile gobbles up the original so I don't see the point. The Assert(target->fileset == NULL) that would be lost in BufFileViewAppend has no value anyway, AFAICS. There is also Assert(source->readOnly) given which the presence or absence of the fileset makes no difference. And if, as you say, extendBufFile were eventually made to work here, this Assert would presumably get removed anyway; I think we'd likely want the additional files to get associated with the shared file set rather than being locally temporary files. > It wastes some cycles to not simply use the BufFile directly, but not > terribly many in the grand scheme of things. This happens once per > external sort operation. I'm not at all concerned about the loss of cycles. I'm concerned about making the mechanism more complicated to understand and maintain for future readers of the code. When experienced hackers see code that doesn't seem to accomplish anything, they (or at least I) tend to assume that there must be a hidden reason for it to be there and spend time trying to figure out what it is. If there actually is no hidden purpose, then that study is a waste of time and we can spare them the trouble by getting rid of it now. >> In miscadmin.h, I'd put the prototype for the new GUC next to >> max_worker_processes, not maintenance_work_mem. > > But then I'd really have to put it next to max_worker_processes in > globals.c, too. That would mean that it would go under "Primary > determinants of sizes of shared-memory structures" within globals.c, > which seems wrong to me. What do you think? OK, that's a fair point. > I think I still get the gist of what you're saying, though. I've come > up with a new structure that is a noticeable improvement on what I > had. Importantly, the new structure let me add a number of > parallelism-agnostic asserts that make sure that every ambuild routine > that supports parallelism gets the details right. Yes, that looks better. I'm slightly dubious that the new Asserts() are worthwhile, but I guess it's OK. But I think it would be better to ditch the if-statement and do it like this: Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot)); Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) : !TransactionIdIsValid(OldestXmin)); Assert(snapshot == SnapshotAny || !anyvisible); Also, I think you've got a little more than you need in terms of comments. I would keep the comments for the serial case and parallel case and drop the earlier one that basically says the same thing: + * (Note that parallel case never has us register/unregister snapshot, and + * provides appropriate snapshot for us.) > There is a little bit of ambiguity about other cases, though -- that's > the secondary point I tried to make within that comment block, and the > part that you took issue with. To put this secondary point another > way: It's possible that we'd fail to detect it if someone's comparator > went bananas and decided it was okay to do SQL access (that resulted > in an index scan of the index undergoing reindex). That does seem > rather unlikely, but I felt it necessary to say something like this > because ReindexIsProcessingIndex() isn't already something that only > deals with catalog indexes -- it works with all indexes. I agree that it isn't particularly likely, but if somebody found it worthwhile to insert guards against those cases, maybe we should preserve them instead of abandoning them. It shouldn't be that hard to propagate those values from the leader to the workers. The main difficulty there seems to be that we're creating the parallel context in nbtsort.c, while the state that would need to be propagated is private to index.c, but there are several ways to solve that problem. It looks to me like the most robust approach would be to just make that part of what parallel.c naturally does. Patch for that attached. > Here's what I've done based on your feedback: I've changed the header > comments, but stopped leaderParticipates from affecting the > compute_parallel_worker() calculation (so, as I said, > leaderParticipates stays). The leaderParticipates argument continues > to affect these two aspects of plan_create_index_workers()'s return > value: > > 1. It continues to be used so we have a total number of participants > (not workers) to apply our must-have-32MB-workMem limit on > participants. > > Parallel query has no equivalent of this, and it seems warranted. Note > that this limit is no longer applied when parallel_workers storage > param was set, as discussed. > > 2. I continue to use the leaderParticipates argument to disallow the > case where there is only one CREATE INDEX participant but parallelism > is in use, because, of course, that clearly makes no sense -- we > should just use a serial sort instead. That's an improvement, but see below. > (It might make sense to allow this if parallel_leader_participation > was *purely* a testing GUC, only for use by by backend hackers, but > AFAICT it isn't.) As applied to parallel CREATE INDEX, it pretty much is just a testing GUC, which is why I was skeptical about leaving support for it in the patch. There's no anticipated advantage to having the leader not participate -- unlike for parallel queries, where it is quite possible that setting parallel_leader_participation=off could be a win, even generally. If you just have a Gather over a parallel sequential scan, it is unlikely that parallel_leader_participation=off will help; it will most likely hurt, at least up to the point where more participants become a bad idea in general due to contention. However, if you have a complex plan involving fairly-large operations that cannot be divided up among workers, such as a Parallel Append or a Hash Join with a big startup cost or a Sort that happens in the worker or even a parallel Index Scan that takes a long time to advance to the next page because it has to do I/O, you might leave workers idling while the leader is trying to "help". Some users may have workloads where this is the normal case. Ideally, the planner would figure out whether this is likely and tell the leader whether or not to participate, but we don't know how to figure that out yet. On the other hand, for CREATE INDEX, having the leader not participate can't really improve anything. In other words, right now, parallel_leader_participation is not strictly a testing GUC, but if we make CREATE INDEX respect it, then we're pushing it towards being a GUC that you don't ever want to enable except for testing. I'm still not sure that's a very good idea, but if we're going to do it, then surely we should be consistent. It's true that having one worker and no parallel leader participation can never be better than just having the leader do it, but it is also true that having two leaders and no parallel leader participation can never be better than having 1 worker with leader participation. I don't see a reason to treat those cases differently. If we're going to keep parallel_leader_participation support here, I think the last hunk in config.sgml should read more like this: Allows the leader process to execute the query plan under <literal>Gather</literal> and <literal>Gather Merge</literal> nodes and to participate in parallel index builds. The default is <literal>on</literal>. For queries, setting this value to <literal>off</literal> reduces the likelihood that workers will become blocked because the leader is not reading tuples fast enough, but requires the leader process to wait for worker processes to start up before the first tuples can be produced. The degree to which the leader can help or hinder performance depends on the plan type or index build strategy, number of workers and query duration. For index builds, setting this value to <literal>off</literal> is expected to reduce performance, but may be useful for testing purposes. > I suspect that the parameters of any cost model for parallel CREATE > INDEX that we're prepared to consider for v11 are: "Use a number of > parallel workers that is one below the number at which the total > duration of the CREATE INDEX either stays the same or goes up". That's pretty much the definition of a correct cost model; the trick is how to implement it without an oracle. > BTW, the 32MB per participant limit within plan_create_index_workers() > was chosen based on the realization that any higher value would make > having a default setting of 2 for max_parallel_maintenance_workers (to > match the max_parallel_workers_per_gather default) pointless when the > default maintenance_work_mem value of 64MB is in use. That's not > terribly scientific, though it at least doesn't come at the expense of > a more scientific idea for a limit like that (I don't actually have > one, you see). I am merely trying to avoid being *gratuitously* > wasteful of shared resources that are difficult to accurately cost in > (e.g., the distributed cost of random I/O to the system as a whole > when we do a parallel index build while ridiculously low on > maintenance_work_mem). I see. I think it's a good start. I wonder in general whether it's better to add memory or add workers. In other words, suppose I have a busy system where my index builds are slow. Should I try to free up some memory so that I can raise maintenance_work_mem, or should I try to free up some CPU resources so I can raise max_parallel_maintenance_workers? The answer doubtless depends on the current values that I have configured for those settings and the type of data that I'm indexing, as well as how much memory I could free up how easily and how much CPU I could free up how easily. But I wish I understood better than I do which one was more likely to help in a given situation. I also wonder what the next steps would be to make this whole thing scale better. From the performance tests that have been performed so far, it seems like adding a modest number of workers definitely helps, but it tops out around 2-3x with 4-8 workers. I understand from your previous comments that's typical of other databases. It also seems pretty clear that more memory helps but only to a point. For instance, I just tried "create index x on pgbench_accounts (aid)" without your patch at scale factor 1000. With maintenance_work_mem = 1MB, it generated 6689 runs and took 131 seconds. With maintenance_work_mem = 64MB, it took 67 seconds. With maintenance_work_mem = 1GB, it took 60 seconds. More memory didn't help, even if the sort could be made entirely internal. This seems to be a fairly typical pattern: using enough memory can buy you a small multiple, using a bunch of workers can buy you a small multiple, but then it just doesn't get faster. Yet, in theory, it seems like if we're willing to provide essentially unlimited memory and CPU resources, we ought to be able to make this go almost arbitrarily fast. >> I think that the naming of the wait events could be improved. Right >> now, they are named by which kind of process does the waiting, but it >> really should be named based on what the thing for which we're >> waiting. I also suggest that we could just write Sort instead of >> Tuplesort. In short, I suggest ParallelTuplesortLeader -> >> ParallelSortWorkersDone and ParallelTuplesortLeader -> >> ParallelSortTapeHandover. > > WFM. Also added documentation for the wait events to monitoring.sgml, > which I somehow missed the first time around. But you forgot to update the preceding "morerows" line, so the formatting will be all messed up. >> + * Make sure that the temp file(s) underlying the tape set are created in >> + * suitable temp tablespaces. This is only really needed for serial >> + * sorts. >> >> This comment makes me wonder whether it is "sorta" needed for parallel sorts. > > I removed "really". The point of the comment is that we've already set > up temp tablespaces for the shared fileset in the parallel case. > Shared filesets figure out which tablespaces will be used up-front -- > see SharedFileSetInit(). So why not say it that way? i.e. For parallel sorts, this should have been done already, but it doesn't matter if it gets done twice. > I updated tuplesort_end() so that trace_sort reports on the end of the > sort, even for worker processes. (We still don't show generic > tuplesort_begin* message for workers, though.) I don't see any reason not to make those contingent only on trace_sort. The user can puzzle apart which messages are which from the PIDs in the logfile. >> Instead of making a special case in CreateParallelContext for >> serializable_okay, maybe index_build should just use SetConfigOption() >> to force the isolation level to READ COMMITTED right after it does >> NewGUCNestLevel(). The change would only be temporary because the >> subsequent call to AtEOXact_GUC() will revert it. > > I tried doing it that way, but it doesn't seem workable: > > postgres=# begin transaction isolation level serializable ; > BEGIN > postgres=*# reindex index test_unique; > ERROR: 25001: SET TRANSACTION ISOLATION LEVEL must be called before any query > LOCATION: call_string_check_hook, guc.c:9953 Bummer. > Instead, I added a bit more to comments within > CreateParallelContext(), to justify what I've done along the lines you > went into. Hopefully this works better for you. Yeah, that seems OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Jan 17, 2018 at 10:27 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 17, 2018 at 12:27 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> I think that both problems (the live _bt_parallel_scan_and_sort() bug, >> as well as the general issue with needing to account for parallel >> worker fork() failure) are likely solvable by not using >> tuplesort_leader_wait(), and instead calling >> WaitForParallelWorkersToFinish(). Which you suggested already. > > I'm wondering if this shouldn't instead be handled by using the new > Barrier facilities. > While I find the Barrier API slightly confusing -- and I suspect I'm > not entirely alone -- I don't think that's a good excuse for > reinventing the wheel. The problem of needing to wait for every > process that does A (in this case, read tuples from the scan) to also > do B (in this case, finish sorting those tuples) is a very general one > that is deserving of a general solution. Unless somebody comes up > with a better plan, Barrier seems to be the way to do that in > PostgreSQL. > > I don't think using WaitForParallelWorkersToFinish() is a good idea. > That would require workers to hold onto their tuplesorts until after > losing the ability to send messages to the leader, which doesn't sound > like a very good plan. We don't want workers to detach from their > error queues until the bitter end, lest errors go unreported. What you say here sounds convincing to me. I actually brought up the idea of using the barrier abstraction a little over a month ago. I was discouraged by a complicated sounding issue raised by Thomas [1]. At the time, I figured that the barrier abstraction was a nice to have, but not really essential. That idea doesn't hold up under scrutiny. I need to be able to use barriers. There seems to be some yak shaving involved in getting the barrier abstraction to do exactly what is required, as Thomas went into at the time. How should that prerequisite work be structured? For example, should a patch be spun off for that part? I may not be the most qualified person for this job, since Thomas considered two alternative approaches (to making the static barrier abstraction forget about never-launched participants) without ever settling on one of them. [1] https://postgr.es/m/CAEepm=03YnefpCeB=Z67HtQAOEMuhKGyPCY_S1TeH=9a2Rr0LQ@mail.gmail.com -- Peter Geoghegan
On Wed, Jan 17, 2018 at 7:00 PM, Peter Geoghegan <pg@bowt.ie> wrote: > There seems to be some yak shaving involved in getting the barrier > abstraction to do exactly what is required, as Thomas went into at the > time. How should that prerequisite work be structured? For example, > should a patch be spun off for that part? > > I may not be the most qualified person for this job, since Thomas > considered two alternative approaches (to making the static barrier > abstraction forget about never-launched participants) without ever > settling on one of them. I had forgotten about the previous discussion. The sketch in my previous email supposed that we would use dynamic barriers since the whole point, after all, is to handle the fact that we don't know how many participants will really show up. Thomas's idea seems to be that the leader will initialize the barrier based on the anticipated number of participants and then tell it to forget about the participants that don't materialize. Of course, that would require that the leader somehow figure out how many participants didn't show up so that it can deduct then from the counter in the barrier. And how is it going to do that? It's true that the leader will know the value of nworkers_launched, but as the comment in LaunchParallelWorkers() says: "The caller must be able to tolerate ending up with fewer workers than expected, so there is no need to throw an error here if registration fails. It wouldn't help much anyway, because registering the worker in no way guarantees that it will start up and initialize successfully." So it seems to me that a much better plan than having the leader try to figure out how many workers failed to launch would be to just keep a count of how many workers did in fact launch. The count can be stored in shared memory, and each worker that comes along can increment it. Then we don't have to worry about whether we accurately detect failure to launch. We can argue about whether it's possible to detect all cases of failure to launch unerringly, but what's for sure is that if a worker increments a counter in shared memory, it launched. Now, where should this counter be located? There are of course multiple possibilities, but in my sketch it goes in some_barrier_variable->nparticipants i.e. we just use a dynamic barrier. So my position (at least until Thomas or Andres shows up and tells me why I'm wrong) is that you can use the Barrier API just as it is without any yak-shaving, just by following the sketch I set out before. The additional API I proposed in that sketch isn't really required, although it might be more efficient. But it doesn't really matter: if that comes along later, it will be trivial to adjust the code to take advantage of it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> While it certainly did occur to me that that was kind of weird, and I >> struggled with it on my own for a little while, I ultimately agreed >> with Thomas that it added something to have ltsConcatWorkerTapes() >> call some buffile function in every iteration of its loop. >> (BufFileView() + BufFileViewAppend() are code that Thomas actually >> wrote, though I added the asserts and comments myself.) > > Hmm, well, if Thomas contributed code to this patch, then he needs to > be listed as an author. I went searching for an email on this thread > (or any other) where he posted code for this, thinking that there > might be some discussion explaining the motivation, but I didn't find > any. I'm still in favor of erasing this distinction. I cleared this with Thomas recently, on this very thread, and got a +1 from him on not listing him as an author. Still, I have no problem crediting Thomas as an author instead of a reviewer, even though you're now asking me to remove what little code he actually authored. The distinction between secondary author and reviewer is often blurred, anyway. Whether or not Thomas is formally a co-author is ambiguous, and not something that I feel strongly about (there is no ambiguity about the fact that he made a very useful contribution, though -- he certainly did, both directly and indirectly). I already went out of my way to ensure that Heikki receives a credit for parallel CREATE INDEX in the v11 release notes, even though I don't think that there is any formal rule requiring me to do so -- he *didn't* write even one line of code in this patch. (That was just my take on another ambiguous question about authorship.) I suggest that we revisit this when you're just about to commit the patch. Or you can just add his name -- I like to err on the side of being inclusive. >> If you think about this in terms of the interface rather than the >> implementation, then it may make more sense. The encapsulation adds >> something which might pay off later, such as when extendBufFile() >> needs to work with a concatenated set of BufFiles. And even right now, >> I cannot simply reuse the BufFile without then losing the assert that >> is currently in BufFileViewAppend() (must not have associated shared >> fileset assert). So I'd end up asserting less (rather than more) there >> if BufFileView() was removed. > > I would see the encapsulation as having some value if the original > BufFile remained valid and the new view were also valid. Then the > BufFileView operation is a bit like a copy-on-write filesystem > snapshot: you have the original, which you can do stuff with, and you > have a copy, which can be manipulated independently, but the copying > is cheap. But here the BufFile gobbles up the original so I don't see > the point. I'll see what I can do about this. >> I think I still get the gist of what you're saying, though. I've come >> up with a new structure that is a noticeable improvement on what I >> had. Importantly, the new structure let me add a number of >> parallelism-agnostic asserts that make sure that every ambuild routine >> that supports parallelism gets the details right. > > Yes, that looks better. I'm slightly dubious that the new Asserts() > are worthwhile, but I guess it's OK. Bear in mind that the asserts basically amount to a check that the am propagated indexInfo->ii_Concurrent correct within workers. It's nice to be able to do this in a way that applies equally well to the serial case. > But I think it would be better > to ditch the if-statement and do it like this: > > Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot)); > Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) > : !TransactionIdIsValid(OldestXmin)); > Assert(snapshot == SnapshotAny || !anyvisible); > > Also, I think you've got a little more than you need in terms of > comments. I would keep the comments for the serial case and parallel > case and drop the earlier one that basically says the same thing: Okay. >> (ReindexIsProcessingIndex() issue with non-catalog tables) > > I agree that it isn't particularly likely, but if somebody found it > worthwhile to insert guards against those cases, maybe we should > preserve them instead of abandoning them. It shouldn't be that hard > to propagate those values from the leader to the workers. The main > difficulty there seems to be that we're creating the parallel context > in nbtsort.c, while the state that would need to be propagated is > private to index.c, but there are several ways to solve that problem. > It looks to me like the most robust approach would be to just make > that part of what parallel.c naturally does. Patch for that attached. If you think it's worth the cycles, then I have no objection. I will point out that this means that everything that I say about ReindexIsProcessingIndex() no longer applies, because the relevant state will now be propagated. It doesn't need to be mentioned at all, and I don't even need to forbid builds on catalogs. Should I go ahead and restore builds on catalogs, and remove those comments, on the assumption that your patch will be committed before mine? Obviously parallel index builds on catalogs don't matter. OTOH, why not? Perhaps it's like the debate around HOT that took place over 10 years ago, where Tom insisted that HOT work with catalogs on general principle. >> (It might make sense to allow this if parallel_leader_participation >> was *purely* a testing GUC, only for use by by backend hackers, but >> AFAICT it isn't.) > > As applied to parallel CREATE INDEX, it pretty much is just a testing > GUC, which is why I was skeptical about leaving support for it in the > patch. There's no anticipated advantage to having the leader not > participate -- unlike for parallel queries, where it is quite possible > that setting parallel_leader_participation=off could be a win, even > generally. If you just have a Gather over a parallel sequential scan, > it is unlikely that parallel_leader_participation=off will help; it > will most likely hurt, at least up to the point where more > participants become a bad idea in general due to contention. It's unlikely to hurt much, since as you yourself said, compute_parallel_worker() doesn't consider the leader's participation. Actually, if we assume that compute_parallel_worker() is perfect, then surely parallel_leader_participation=off would beat parallel_leader_participation=on for CREATE INDEX -- it would allow us to use the value that compute_parallel_worker() truly intended. Which is the opposite of what you say about parallel_leader_participation=off above. I am only trying to understand your perspective here. I don't think that parallel_leader_participation support is that important. I think that parallel_leader_participation=off might be slightly useful as a way of discouraging parallel CREATE INDEX on smaller tables, just like it is for parallel sequential scan (though this hinges on specifically disallowing "degenerate parallel scan" cases). More often, it will make hardly any difference if parallel_leader_participation is on or off. > In other words, right now, parallel_leader_participation is not > strictly a testing GUC, but if we make CREATE INDEX respect it, then > we're pushing it towards being a GUC that you don't ever want to > enable except for testing. I'm still not sure that's a very good > idea, but if we're going to do it, then surely we should be > consistent. I'm confused. I *don't* want it to be something that you can only use for testing. I want to not hurt whatever case there is for the parallel_leader_participation GUC being something that a DBA may tune in production. I don't see the conflict here. > It's true that having one worker and no parallel leader > participation can never be better than just having the leader do it, > but it is also true that having two leaders and no parallel leader > participation can never be better than having 1 worker with leader > participation. I don't see a reason to treat those cases differently. You must mean "having two workers and no parallel leader participation...". The reason to treat those two cases differently is simple: One couldn't possibly be desirable in production, and undermines the whole idea of parallel_leader_participation being user visible by adding a sharp edge. The other is likely to be pretty harmless, especially because leader participation is generally pretty fudged, and our cost model is fairly rough. The difference here isn't what is important; avoiding doing something that we know couldn't possibly help under any circumstances is important. I think that we should do that on general principle. As I said in a prior e-mail, even parallel query's use of parallel_leader_participation is consistent with what I propose here, practically speaking, because a partial path without leader participation will always lose to a serial sequential scan path in practice. The fact that the optimizer will create a partial path that makes a useless "degenerate parallel scan" a *theoretical* possibility is irrelevant, because the optimizer has its own way of making sure that such a plan doesn't actually get picked. It has its way, and so I must have my own. > If we're going to keep parallel_leader_participation support here, I > think the last hunk in config.sgml should read more like this: > > Allows the leader process to execute the query plan under > <literal>Gather</literal> and <literal>Gather Merge</literal> nodes > and to participate in parallel index builds. The default is > <literal>on</literal>. For queries, setting this value to > <literal>off</literal> reduces the likelihood that workers will become > blocked because the leader is not reading tuples fast enough, but > requires the leader process to wait for worker processes to start up > before the first tuples can be produced. The degree to which the > leader can help or hinder performance depends on the plan type or > index build strategy, number of workers and query duration. For index > builds, setting this value to <literal>off</literal> is expected to > reduce performance, but may be useful for testing purposes. Why is CREATE INDEX really that different in terms of the downside for production DBAs? I think it's more accurate to say that it's not expected to improve performance. What do you think? >> I suspect that the parameters of any cost model for parallel CREATE >> INDEX that we're prepared to consider for v11 are: "Use a number of >> parallel workers that is one below the number at which the total >> duration of the CREATE INDEX either stays the same or goes up". > > That's pretty much the definition of a correct cost model; the trick > is how to implement it without an oracle. Correct on its own terms, at least. What I meant to convey here is that there is little scope to do better in v11 on distributed costs for the system as a whole, and therefore little scope to improve the cost model. >> BTW, the 32MB per participant limit within plan_create_index_workers() >> was chosen based on the realization that any higher value would make >> having a default setting of 2 for max_parallel_maintenance_workers (to >> match the max_parallel_workers_per_gather default) pointless when the >> default maintenance_work_mem value of 64MB is in use. > I see. I think it's a good start. I wonder in general whether it's > better to add memory or add workers. In other words, suppose I have a > busy system where my index builds are slow. Should I try to free up > some memory so that I can raise maintenance_work_mem, or should I try > to free up some CPU resources so I can raise > max_parallel_maintenance_workers? This is actually all about distributed costs, I think. Provided you have a reasonably sympathetic index build, like say a random numeric column index build, and the index won't be multiple gigabytes in size, then 1MB of maintenance_work_mem still seems to win with parallelism. This seems extremely "selfish", though -- that's going to incur a lot of random I/O for an operation that is typically characterized by sequential I/O. Plus, I bet you're using quite a bit more memory than 1MB, in the form of FS cache. It seems hard to lose if you don't care about distributed costs, especially if it's a matter of using 1 or 2 parallel workers versus just doing a serial build. Granted, you go into a 1MB of maintenance_work_mem case below where parallelism loses, which seems to contradict my suggestion that you practically cannot lose with parallelism. However, ISTM that you really went out of your way to find a case that lost. Of course, I'm not arguing that it's okay for parallel CREATE INDEX to be selfish -- it isn't. I'm prepared to say that you shouldn't use parallelism if you have 1MB of maintenance_work_mem, no matter how much it seems to help (and though it might sound crazy, because it is, it *can* help). I'm just surprised that you've not said a lot more about distributed costs, because that's where all the potential benefit seems to be. It happens to be an area that we have no history of modelling in any way, which makes it hard, but that's the situation we seem to be in. > I also wonder what the next steps would be to make this whole thing > scale better. From the performance tests that have been performed so > far, it seems like adding a modest number of workers definitely helps, > but it tops out around 2-3x with 4-8 workers. I understand from your > previous comments that's typical of other databases. Yes. This patch seems to have scalability that is very similar to the scalability that you get with similar features in other systems. I have not verified this through first hand experience/experiments, because I don't have access to that stuff. But I have found numerous reports related to more than one other system. I don't think that this is the only goal that matters, but I do think that it's an interesting data point. > It also seems > pretty clear that more memory helps but only to a point. For > instance, I just tried "create index x on pgbench_accounts (aid)" > without your patch at scale factor 1000. Again, I discourage everyone from giving too much weight to index builds like this one. This does not involve very much sorting at all, because everything is already in order, and the comparisons are cheap int4 comparisons. It may already be very I/O bound before you start to use parallelism. > With maintenance_work_mem = > 1MB, it generated 6689 runs and took 131 seconds. With > maintenance_work_mem = 64MB, it took 67 seconds. With > maintenance_work_mem = 1GB, it took 60 seconds. More memory didn't > help, even if the sort could be made entirely internal. This seems to > be a fairly typical pattern: using enough memory can buy you a small > multiple, using a bunch of workers can buy you a small multiple, but > then it just doesn't get faster. Adding memory is just as likely to hurt slightly as help slightly, especially if you're talking about CREATE INDEX, where being able to use a final on-the-fly merge is a big deal (you can hide the cost of the merging by doing it when you're very I/O bound anyway). This should be true with only modest assumptions: I assume that you're in one pass territory here, and that you have a reasonably small merge heap (with perhaps no more than 100 runs). This seems likely to be true the vast majority of the time with CREATE INDEX, assuming the system is reasonably well configured. Roughly speaking, once my assumptions are met, the exact number of runs almost doesn't matter (that's at least useful as a mental model). I basically disagree with the statement "using enough memory can buy you a small multiple", since it's only true when you started out using an unreasonably small amount of memory. Bear in mind that every time maintenance_work_mem is doubled, our capacity to do sorts in one pass quadruples. Using 1MB of maintenance_work_mem just doesn't make sense *economically*, unless, perhaps, you care about neither the duration of the CREATE INDEX statement, nor your electricity bill. You cannot extrapolate anything useful from an index build that uses only 1MB of maintenance_work_mem for all kinds of reasons. I suggest taking another look at Prabhat's results. Here are my observations about them: * For serial sorts, a person reading his results could be forgiven for thinking that increasing the amount of memory for a sort makes it go *slower*, at least by a bit. * Sometimes that doesn't happen for serial sorts, and sometimes it does happen for parallel sorts, but mostly it hurts serial sorts and helps parallel sorts, since Prabhat didn't start with an unreasonable low amount of maintenance_work_mem. * All the indexes are built against the same table. For the serial cases, among each index that was built, the longest build took about 6x more time than the shortest. For parallel builds, it's more like a 3x difference. The difference gets smaller when you eliminate cases that actually have to do almost no sorting. This "3x vs. 6x" difference matters a lot. This suggests to me that parallel CREATE INDEX has proven itself as something that can take a mostly CPU bound index build, and make it into a mostly I/O bound index build. It also suggests that we can make better use of memory with parallel CREATE INDEX only because workers will still need to get a reasonable amount of memory. You definitely don't want multiple passes in workers, but for the same reasons that you don't want them in serial cases. > Yet, in theory, it seems like if > we're willing to provide essentially unlimited memory and CPU > resources, we ought to be able to make this go almost arbitrarily > fast. The main reason that the scalability of CREATE INDEX has trouble getting past about 3.5x in cases we've seen doesn't involve any scalability theory: we're very much I/O bound during the merge, because we have to actually write out the index, regardless of what tuplesort does or doesn't do. I've seen over 4x improvements on systems that have sufficient temp file sequential I/O bandwidth, and reasonably sympathetic data distributions/types. >> WFM. Also added documentation for the wait events to monitoring.sgml, >> which I somehow missed the first time around. > > But you forgot to update the preceding "morerows" line, so the > formatting will be all messed up. Fixed. >> I removed "really". The point of the comment is that we've already set >> up temp tablespaces for the shared fileset in the parallel case. >> Shared filesets figure out which tablespaces will be used up-front -- >> see SharedFileSetInit(). > > So why not say it that way? i.e. For parallel sorts, this should have > been done already, but it doesn't matter if it gets done twice. Okay. > I don't see any reason not to make those contingent only on > trace_sort. The user can puzzle apart which messages are which from > the PIDs in the logfile. Okay. I have removed anything that restrains the verbosity of trace_sort for the WORKER() case. I think that you were right about it the first time, but I now think that this is going too far. I'm letting it go, though. -- Peter Geoghegan
On Wed, Jan 17, 2018 at 6:20 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I had forgotten about the previous discussion. The sketch in my > previous email supposed that we would use dynamic barriers since the > whole point, after all, is to handle the fact that we don't know how > many participants will really show up. Thomas's idea seems to be that > the leader will initialize the barrier based on the anticipated number > of participants and then tell it to forget about the participants that > don't materialize. Of course, that would require that the leader > somehow figure out how many participants didn't show up so that it can > deduct then from the counter in the barrier. And how is it going to > do that? I don't know; Thomas? > It's true that the leader will know the value of nworkers_launched, > but as the comment in LaunchParallelWorkers() says: "The caller must > be able to tolerate ending up with fewer workers than expected, so > there is no need to throw an error here if registration fails. It > wouldn't help much anyway, because registering the worker in no way > guarantees that it will start up and initialize successfully." So it > seems to me that a much better plan than having the leader try to > figure out how many workers failed to launch would be to just keep a > count of how many workers did in fact launch. > So my position (at least until Thomas or Andres shows up and tells me > why I'm wrong) is that you can use the Barrier API just as it is > without any yak-shaving, just by following the sketch I set out > before. The additional API I proposed in that sketch isn't really > required, although it might be more efficient. But it doesn't really > matter: if that comes along later, it will be trivial to adjust the > code to take advantage of it. Okay. I'll work on adopting dynamic barriers in the way you described. I just wanted to make sure that we're all on the same page about what that looks like. -- Peter Geoghegan
Hi, I'm mostly away from my computer this week -- sorry about that, but here are a couple of quick answers to questions directed at me: On Thu, Jan 18, 2018 at 4:22 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> While it certainly did occur to me that that was kind of weird, and I >>> struggled with it on my own for a little while, I ultimately agreed >>> with Thomas that it added something to have ltsConcatWorkerTapes() >>> call some buffile function in every iteration of its loop. >>> (BufFileView() + BufFileViewAppend() are code that Thomas actually >>> wrote, though I added the asserts and comments myself.) >> >> Hmm, well, if Thomas contributed code to this patch, then he needs to >> be listed as an author. I went searching for an email on this thread >> (or any other) where he posted code for this, thinking that there >> might be some discussion explaining the motivation, but I didn't find >> any. I'm still in favor of erasing this distinction. > > I cleared this with Thomas recently, on this very thread, and got a +1 > from him on not listing him as an author. Still, I have no problem > crediting Thomas as an author instead of a reviewer, even though > you're now asking me to remove what little code he actually authored. > The distinction between secondary author and reviewer is often > blurred, anyway. The confusion comes about because I gave some small code fragments to Rushabh for the BufFileView stuff off-list, when suggesting ideas for how to integrate Peter's patch with some ancestor of my SharedFileSet patch. It was just a sketch and whether or not any traces remain in the final commit, please credit me as a reviewer. I need to review more patches! /me ducks No objections from me if you hate the "view" idea or implementation and think it's better to make a destructive append-BufFile-to-BufFile operation instead. On Thu, Jan 18, 2018 at 4:28 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Jan 17, 2018 at 6:20 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I had forgotten about the previous discussion. The sketch in my >> previous email supposed that we would use dynamic barriers since the >> whole point, after all, is to handle the fact that we don't know how >> many participants will really show up. Thomas's idea seems to be that >> the leader will initialize the barrier based on the anticipated number >> of participants and then tell it to forget about the participants that >> don't materialize. Of course, that would require that the leader >> somehow figure out how many participants didn't show up so that it can >> deduct then from the counter in the barrier. And how is it going to >> do that? > > I don't know; Thomas? The idea I mentioned would only work if nworkers_launched is never over-reported in a scenario that doesn't error out or crash, and never under-reported in any scenario. Otherwise static barriers may be even less useful than I thought. >> It's true that the leader will know the value of nworkers_launched, >> but as the comment in LaunchParallelWorkers() says: "The caller must >> be able to tolerate ending up with fewer workers than expected, so >> there is no need to throw an error here if registration fails. It >> wouldn't help much anyway, because registering the worker in no way >> guarantees that it will start up and initialize successfully." So it >> seems to me that a much better plan than having the leader try to >> figure out how many workers failed to launch would be to just keep a >> count of how many workers did in fact launch. (If nworkers_launched can be silently over-reported, then does parallel_leader_participation = off have a bug? If no workers really launched and reached the main executor loop but nworkers_launched > 0, then no one is running the plan.) >> So my position (at least until Thomas or Andres shows up and tells me >> why I'm wrong) is that you can use the Barrier API just as it is >> without any yak-shaving, just by following the sketch I set out >> before. The additional API I proposed in that sketch isn't really >> required, although it might be more efficient. But it doesn't really >> matter: if that comes along later, it will be trivial to adjust the >> code to take advantage of it. Yeah, the dynamic Barrier API was intended for things like this. I was only trying to provide a simpler-to-use alternative that I thought might work for this particular case (but not executor nodes, which have another source of uncertainty about party size). It sounds like it's not actually workable though, and the dynamic API may be the only way. So the patch would have to deal with explicit phases. > Okay. I'll work on adopting dynamic barriers in the way you described. > I just wanted to make sure that we're all on the same page about what > that looks like. Looking at Robert's sketch, a few thoughts: (1) it's not OK to attach and then just exit, you'll need to detach from the barrier both in the case where the worker exits early because the phase is too high and the case where you attach in in time to help and run to completion; (2) maybe workers could use BarrierArriveAndDetach() at the end (the leader needs to use BarrierArriveAndWait(), but the workers don't really need to wait for each other before they exit, do they?); (3) erm, maybe it's a problem that errors occurring in workers while the leader is waiting at a barrier won't unblock the leader (we don't detach from barriers on abort/exit) -- I'll look into this. -- Thomas Munro http://www.enterprisedb.com
On Thu, Jan 18, 2018 at 4:19 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Hi, > > I'm mostly away from my computer this week -- sorry about that, but > here are a couple of quick answers to questions directed at me: > > On Thu, Jan 18, 2018 at 4:22 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >>> It's true that the leader will know the value of nworkers_launched, >>> but as the comment in LaunchParallelWorkers() says: "The caller must >>> be able to tolerate ending up with fewer workers than expected, so >>> there is no need to throw an error here if registration fails. It >>> wouldn't help much anyway, because registering the worker in no way >>> guarantees that it will start up and initialize successfully." So it >>> seems to me that a much better plan than having the leader try to >>> figure out how many workers failed to launch would be to just keep a >>> count of how many workers did in fact launch. > > (If nworkers_launched can be silently over-reported, then does > parallel_leader_participation = off have a bug? > Yes, and it is being discussed in CF entry [1]. [1] - https://commitfest.postgresql.org/16/1341/ -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 18, 2018 at 8:52 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >>> (It might make sense to allow this if parallel_leader_participation >>> was *purely* a testing GUC, only for use by by backend hackers, but >>> AFAICT it isn't.) >> >> As applied to parallel CREATE INDEX, it pretty much is just a testing >> GUC, which is why I was skeptical about leaving support for it in the >> patch. There's no anticipated advantage to having the leader not >> participate -- unlike for parallel queries, where it is quite possible >> that setting parallel_leader_participation=off could be a win, even >> generally. If you just have a Gather over a parallel sequential scan, >> it is unlikely that parallel_leader_participation=off will help; it >> will most likely hurt, at least up to the point where more >> participants become a bad idea in general due to contention. > > It's unlikely to hurt much, since as you yourself said, > compute_parallel_worker() doesn't consider the leader's participation. > Actually, if we assume that compute_parallel_worker() is perfect, then > surely parallel_leader_participation=off would beat > parallel_leader_participation=on for CREATE INDEX -- it would allow us > to use the value that compute_parallel_worker() truly intended. Which > is the opposite of what you say about > parallel_leader_participation=off above. > > I am only trying to understand your perspective here. I don't think > that parallel_leader_participation support is that important. I think > that parallel_leader_participation=off might be slightly useful as a > way of discouraging parallel CREATE INDEX on smaller tables, just like > it is for parallel sequential scan (though this hinges on specifically > disallowing "degenerate parallel scan" cases). More often, it will > make hardly any difference if parallel_leader_participation is on or > off. > >> In other words, right now, parallel_leader_participation is not >> strictly a testing GUC, but if we make CREATE INDEX respect it, then >> we're pushing it towards being a GUC that you don't ever want to >> enable except for testing. I'm still not sure that's a very good >> idea, but if we're going to do it, then surely we should be >> consistent. > I see your point. OTOH, I think we should have something for testing purpose as that helps in catching the bugs and makes it easy to write tests that cover worker part of the code. > > I'm confused. I *don't* want it to be something that you can only use > for testing. I want to not hurt whatever case there is for the > parallel_leader_participation GUC being something that a DBA may tune > in production. I don't see the conflict here. > >> It's true that having one worker and no parallel leader >> participation can never be better than just having the leader do it, >> but it is also true that having two leaders and no parallel leader >> participation can never be better than having 1 worker with leader >> participation. I don't see a reason to treat those cases differently. > > You must mean "having two workers and no parallel leader participation...". > > The reason to treat those two cases differently is simple: One > couldn't possibly be desirable in production, and undermines the whole > idea of parallel_leader_participation being user visible by adding a > sharp edge. The other is likely to be pretty harmless, especially > because leader participation is generally pretty fudged, and our cost > model is fairly rough. The difference here isn't what is important; > avoiding doing something that we know couldn't possibly help under any > circumstances is important. I think that we should do that on general > principle. > > As I said in a prior e-mail, even parallel query's use of > parallel_leader_participation is consistent with what I propose here, > practically speaking, because a partial path without leader > participation will always lose to a serial sequential scan path in > practice. The fact that the optimizer will create a partial path that > makes a useless "degenerate parallel scan" a *theoretical* possibility > is irrelevant, because the optimizer has its own way of making sure > that such a plan doesn't actually get picked. It has its way, and so I > must have my own. > Can you please elaborate what part of optimizer are you talking about where without leader participation partial path will always lose to a serial sequential scan path? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 17, 2018 at 10:22 PM, Peter Geoghegan <pg@bowt.ie> wrote: > As I said in a prior e-mail, even parallel query's use of > parallel_leader_participation is consistent with what I propose here, > practically speaking, because a partial path without leader > participation will always lose to a serial sequential scan path in > practice. Amit's reply to this part drew my attention to it. I think this is entirely false. Consider an aggregate that doesn't support partial aggregation, and a plan that looks like this: Aggregate -> Gather -> Parallel Seq Scan Filter: something fairly selective It is quite possible for this to be superior to a non-parallel plan even with only 1 worker and no parallel leader participation. The worker can evaluate the filter qual, and the leader can evaluate the aggregate. If the CPU costs of doing those computations are high enough to outweigh the costs of shuffling tuples between backends, we win. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 18, 2018 at 5:49 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I'm mostly away from my computer this week -- sorry about that, Yeah, seriously. Since when it is OK for hackers to ever be away from their computers? :-) > The idea I mentioned would only work if nworkers_launched is never > over-reported in a scenario that doesn't error out or crash, and never > under-reported in any scenario. Otherwise static barriers may be even > less useful than I thought. I just went back to the thread on "parallel.c oblivion of worker-startup failures" and refreshed my memory about what's going on over there. What's going on over there is (1) currently, nworkers_launched can be over-reported in a scenario that doesn't error out or crash and (2) I'm proposing to tighten things up so that this is no longer the case. Amit proposed making it the responsibility of code that uses parallel.c to cope with nworkers_launched being larger than the number that actually launched, and my counter-proposal was to make it reliably ERROR when they don't all launch. So, thinking about this, I think that my proposal to use dynamic barriers here seems like it will work regardless of who wins that argument. Your proposal to use static barriers and decrement the party size based on the number of participants which fail to start will work if I win that argument, but will not work if Amit wins that argument. It seems to me in general that dynamic barriers are to be preferred in almost every circumstance, because static barriers require a longer chain of assumptions. We can't assume that the number of guests we invite to the party will match the number that actually show up, so, in the case of a static barrier, we have to make sure to adjust the party size if some of the guests end up having to stay home with a sick kid or their car breaks down or if they decide to go to the neighbor's party instead. Absentee guests are not intrinsically a problem, but we have to make sure that we account for them in a completely water-tight fashion. On the other hand, with a dynamic barrier, we don't need to worry about the guests that don't show up; we only need to worry about the guests that DO show up. As they come in the door, we count them; as they leave, we count them again. When the numbers are equal, the party's over. That seems more robust. In particular, for parallel query, there is absolutely zero guarantee that every worker reaches every plan node. For a parallel utility command, things seem a little better: we can assume that workers are started only for one particular purpose. But even that might not be true in the future. For example, imagine a parallel CREATE INDEX on a partitioned table that cascades to all children. One can easily imagine wanting to use the same workers for the whole operation and spread them out across the pool of tasks much as Parallel Append does. There's a good chance this will be faster than doing each index build in turn with maximum parallelism. And then the static barrier thing goes right out the window again, because the number of participants is determined dynamically. I really struggle to think of any case where a static barrier is better. I mean, suppose we have an existing party and then decide to hold a baking contest. We'll use a barrier to separate the baking phase from the judging phase. One might think that, since the number of participants is already decided, someone could initialize the barrier with that number rather than making everyone attach. But it doesn't really work, because there's a race: while one process is creating the barrier with participants = 10, the doctor's beeper goes off and he leaves the party. Now there could be some situation in which we are absolutely certain that we know how many participants we've got and it won't change, but I suspect that in almost every scenario deciding to use a static barrier is going to be immediately followed by a lot of angst about how we can be sure that the number of participants will always be correct. >> Okay. I'll work on adopting dynamic barriers in the way you described. >> I just wanted to make sure that we're all on the same page about what >> that looks like. > > Looking at Robert's sketch, a few thoughts: (1) it's not OK to attach > and then just exit, you'll need to detach from the barrier both in the > case where the worker exits early because the phase is too high and > the case where you attach in in time to help and run to completion; In the first case, I guess this is because otherwise the other participants will wait for us even though we're not really there any more. In the second case, I'm not sure why it matters whether we detach. If we've reached the highest possible phase number, nobody's going to wait any more, so who cares? (I mean, apart from tidiness.) > (2) maybe workers could use BarrierArriveAndDetach() at the end (the > leader needs to use BarrierArriveAndWait(), but the workers don't > really need to wait for each other before they exit, do they?); They don't need to wait for each other, but they do need to wait for the leader, so I don't think this works. Logically, there are two key sequencing points. First, the leader needs to wait for the workers to finish sorting. That's the barrier between phase 0 and phase 1. Second, the workers need to wait for the leader to absorb their tapes. That's the barrier between phase 1 and phase 2. If the workers use BarrierArriveAndWait to reach phase 1 and then BarrierArriveAndDetach, they won't wait for the leader to be done adopting their tapes as they do in the current patch. But, hang on a minute. Why do the workers need to wait for the leader anyway? Can't they just exit once they're done sorting? I think the original reason why this ended up in the patch is that we needed the leader to assume ownership of the tapes to avoid having the tapes get blown away when the worker exits. But, IIUC, with sharedfileset.c, that problem no longer exists. The tapes are jointly owned by all of the cooperating backends and the last one to detach from it will remove them. So, if the worker sorts, advertises that it's done in shared memory, and exits, then nothing should get removed and the leader can adopt the tapes whenever it gets around to it. If that's correct, then we only need 2 phases, not 3. Workers BarrierAttach() before reading any data, exiting if the phase is not 0. Otherwise, they then read data and sort it, then advertise the final tape in shared memory, then BarrierArriveAndDetach(). The leader does BarrierAttach() before launching any workers, then reads data and sorts it if applicable, then does BarrierArriveAndWait(). When that returns, all workers are done sorting (and may or may not have finished exiting) and the leader can take over their tapes and everything is fine. That's significantly simpler than my previous outline, and also simpler than what the patch does today. > (3) > erm, maybe it's a problem that errors occurring in workers while the > leader is waiting at a barrier won't unblock the leader (we don't > detach from barriers on abort/exit) -- I'll look into this. I think if there's an ERROR, the general parallelism machinery is going to arrange to kill every worker, so nothing matters in that case unless barrier waits ignore interrupts, which I'm pretty sure they don't. (Also: if they do, I'll hit the ceiling; that would be awful.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 18, 2018 at 6:21 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Amit's reply to this part drew my attention to it. I think this is > entirely false. Consider an aggregate that doesn't support partial > aggregation, and a plan that looks like this: > > Aggregate > -> Gather > -> Parallel Seq Scan > Filter: something fairly selective > > It is quite possible for this to be superior to a non-parallel plan > even with only 1 worker and no parallel leader participation. The > worker can evaluate the filter qual, and the leader can evaluate the > aggregate. If the CPU costs of doing those computations are high > enough to outweigh the costs of shuffling tuples between backends, we > win. That seems pretty far fetched. But even if it wasn't, my position would not change. This could happen only because the planner determined that it was the cheapest plan when parallel_leader_participation happened to be off. But clearly a "degenerate parallel CREATE INDEX" will never be faster than a serial CREATE INDEX, and there is a simple way to always avoid one. So why not do so? I give up. I'll go ahead and make parallel_leader_participation=off allow a degenerate parallel CREATE INDEX in the next version. I think that it will make parallel_leader_participation less useful, with no upside, but there doesn't seem to be much more that I can do about that. -- Peter Geoghegan
On Wed, Jan 17, 2018 at 10:22 PM, Peter Geoghegan <pg@bowt.ie> wrote: > If you think it's worth the cycles, then I have no objection. I will > point out that this means that everything that I say about > ReindexIsProcessingIndex() no longer applies, because the relevant > state will now be propagated. It doesn't need to be mentioned at all, > and I don't even need to forbid builds on catalogs. > > Should I go ahead and restore builds on catalogs, and remove those > comments, on the assumption that your patch will be committed before > mine? Obviously parallel index builds on catalogs don't matter. OTOH, > why not? Perhaps it's like the debate around HOT that took place over > 10 years ago, where Tom insisted that HOT work with catalogs on > general principle. Yes, I think so. If you (or someone else) can review that patch, I'll go ahead and commit it, and then your patch can treat it as a solved problem. I'm not really worried about the cycles; the amount of effort required here is surely very small compared to all of the other things that have to be done when starting a parallel worker. I'm not as dogmatic about the idea that everything must support system catalogs or it's not worth doing as Tom is, but I do think it's better if it can be done that way with reasonable effort. When each new feature comes with a set of unsupported corner cases, it becomes hard for users to understand what will and will not actually work. Now, really big features like parallel query or partitioning or logical replication generally do need to exclude some things in v1 or you can never finish the project, but in this case plugging the gap seems quite feasible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 18, 2018 at 1:14 PM, Peter Geoghegan <pg@bowt.ie> wrote: > That seems pretty far fetched. I don't think it is, and there are plenty of other examples. All you need is a query plan that involves significant CPU work both below the Gather node and above the Gather node. It's not difficult to find plans like that; there are TPC-H queries that generate plans like that. > But even if it wasn't, my position > would not change. This could happen only because the planner > determined that it was the cheapest plan when > parallel_leader_participation happened to be off. But clearly a > "degenerate parallel CREATE INDEX" will never be faster than a serial > CREATE INDEX, and there is a simple way to always avoid one. So why > not do so? That's an excellent argument for making parallel CREATE INDEX ignore parallel_leader_participation entirely. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 18, 2018 at 6:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I see your point. OTOH, I think we should have something for testing > purpose as that helps in catching the bugs and makes it easy to write > tests that cover worker part of the code. This is about the question of whether or not we want to allow parallel_leader_participation to prevent or allow a parallel CREATE INDEX that has 1 parallel worker that does all the sorting, with the leader simply consuming its output without doing any merging (a "degenerate paralllel CREATE INDEX"). It is perhaps only secondarily about the question of ripping out parallel_leader_participation entirely. > Can you please elaborate what part of optimizer are you talking about > where without leader participation partial path will always lose to a > serial sequential scan path? See my remarks to Robert just now. -- Peter Geoghegan
On Thu, Jan 18, 2018 at 10:27 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jan 18, 2018 at 1:14 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> That seems pretty far fetched. > > I don't think it is, and there are plenty of other examples. All you > need is a query plan that involves significant CPU work both below the > Gather node and above the Gather node. It's not difficult to find > plans like that; there are TPC-H queries that generate plans like > that. You need to have a very selective qual in the worker, that eliminates most input (keeps the worker busy), and yet manages to keep the leader busy rather than waiting on input from the gather. >> But even if it wasn't, my position >> would not change. This could happen only because the planner >> determined that it was the cheapest plan when >> parallel_leader_participation happened to be off. But clearly a >> "degenerate parallel CREATE INDEX" will never be faster than a serial >> CREATE INDEX, and there is a simple way to always avoid one. So why >> not do so? > > That's an excellent argument for making parallel CREATE INDEX ignore > parallel_leader_participation entirely. I'm done making arguments about parallel_leader_participation. Tell me what you want, and I'll do it. -- Peter Geoghegan
On Thu, Jan 18, 2018 at 10:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Should I go ahead and restore builds on catalogs, and remove those >> comments, on the assumption that your patch will be committed before >> mine? Obviously parallel index builds on catalogs don't matter. OTOH, >> why not? Perhaps it's like the debate around HOT that took place over >> 10 years ago, where Tom insisted that HOT work with catalogs on >> general principle. > > Yes, I think so. If you (or someone else) can review that patch, I'll > go ahead and commit it, and then your patch can treat it as a solved > problem. I'm not really worried about the cycles; the amount of > effort required here is surely very small compared to all of the other > things that have to be done when starting a parallel worker. Review of your patch: * SerializedReindexState could use some comments. At least a one liner stating its basic purpose. * The "System index reindexing support" comment block could do with a passing acknowledgement of the fact that this is serialized for parallel workers. * Maybe the "Serialize reindex state" comment within InitializeParallelDSM() should instead say something like "Serialize indexes-pending-reindex state". Other than that, looks good to me. It's a simple patch with a clear purpose. -- Peter Geoghegan
On Thu, Jan 18, 2018 at 9:22 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I just went back to the thread on "parallel.c oblivion of > worker-startup failures" and refreshed my memory about what's going on > over there. What's going on over there is (1) currently, > nworkers_launched can be over-reported in a scenario that doesn't > error out or crash and (2) I'm proposing to tighten things up so that > this is no longer the case. I think that we need to be able to rely on nworkers_launched to not over-report the number of workers launched. To be fair to Amit, I haven't actually gone off and studied the problem myself, so it's not fair to dismiss his point of view. It nevertheless seems to me that it makes life an awful lot easier to be able to rely on nworkers_launched. > So, thinking about this, I think that my proposal to use dynamic > barriers here seems like it will work regardless of who wins that > argument. Your proposal to use static barriers and decrement the > party size based on the number of participants which fail to start > will work if I win that argument, but will not work if Amit wins that > argument. Sorry, but I've changed my mind. I don't think barriers owned by tuplesort.c will work for us (though I think we will still need a synchronization primitive within nbtsort.c). The design that Robert sketched for using barriers seemed fine at first. But then I realized: what about the case where you have 2 spools? I now understand why Thomas thought that I'd end up using static barriers, because I now see that dynamic barriers have problems of their own if used by tuplesort.c, even with the trick of only having participants actually participate on the condition that they show up before the party is over (before there is no tuples left for the worker to consume). The idea of the leader using nworkers_launched as the assumed-launched number of workers is pretty much baked into my patch, because my patch makes tuplesorts composable (e.g. nbtsort.c uses two tuplesorts when there is a unique index build/2 spools). Do individual workers need to be prepared to back out of the main spool's sort, but not the spool2 sort (for unique index builds), or vice-versa? Clearly that's untenable, because they're going to need to have both as long as they're participating in a parallel CREATE INDEX (of a unique index) -- IndexBuildHeapScan() expects both at the same time, but there is a race condition when launching workers with 2 spools. So does nbtsort.c need to own the barrier instead? If it does, and if that barrier subsumes the responsibilities of tuplesort.c's condition variables, then I don't see how that can avoid causing a mess due to confusion about phases across tuplesorts/spools. nbtsort.c *will* need some synchronization primitive, actually, (I'm thinking of a condition variable), but only because of the fact that nbtsort.c happens to want to aggregate statistics about the sort at the end (for pg_index) -- this doesn't seem like tuplesort's problem at all. In general, it's very natural to just call tuplesort_leader_wait(), and have all the relevant details encapsulated within tuplesort.c. We could make tuplesort_leader_wait() totally optional, and just use the barrier within nbtsort.c for the wait (more on that later). > In particular, for parallel query, there is absolutely zero guarantee > that every worker reaches every plan node. For a parallel utility > command, things seem a little better: we can assume that workers are > started only for one particular purpose. But even that might not be > true in the future. I expect workers that are reported launched to show up eventually, or report failure. They don't strictly have to do any work beyond just showing up (finding no tuples, reaching tuplesort_performsort(), then finally reaching tuplesort_end()). The spool2 issue I describe above shows why this is. They own the state (tuplesort tuples) that they consume, and may possibly have 2 or more tuplesorts. If they cannot do the bare minimum of checking in with us, then we're in big trouble, because that's indistinguishable from their having actually sorted some tuples without our knowing that the leader ultimately gets to consume them. It wouldn't be impossible to use barriers for everything. That just seems to be incompatible with tuplesorts being composable. Long ago, nbtsort.c actually did the sorting, too. If that was still true, then it would be rather a lot more like parallel hashjoin, I think. You could then just have one barrier for one state machine (with one or two spools). It seems clear that we should avoid teaching tuplesort.c about nbtsort.c. > But, hang on a minute. Why do the workers need to wait for the leader > anyway? Can't they just exit once they're done sorting? I think the > original reason why this ended up in the patch is that we needed the > leader to assume ownership of the tapes to avoid having the tapes get > blown away when the worker exits. But, IIUC, with sharedfileset.c, > that problem no longer exists. You're right. This is why we could make calling tuplesort_leader_wait() optional. We only need one condition variable in tuplesort.c. Which makes me even less inclined to make the remaining workersFinishedCv condition variable into a barrier, since it's not at all barrier-like. After all, workers don't care about each other's progress, or where the leader is. The leader needs to wait until all known-launched participants report having finished, which seems like a typical reason to use a condition variable. That doesn't seem phase-like at all. As for workers, they don't have phases ("done" isn't a phase for them, because as you say, there is no need for them to wait until the leader says they can go with the shared fileset stuff -- that's the leader's problem alone.) I guess the fact that tuplesort_leader_wait() could be optional means that it could be removed, which means that we could in fact throw out the last condition variable within tuplesort.c, and fully rely on using a barrier for everything within nbtsort.c. However, tuplesort_leader_wait() seems kind of like something that we should have on general principle. And, more importantly, it would be tricky to use a barrier even for this, because we still have that baked-in assumption that nworkers_launched is the single source of truth about the number of participants. -- Peter Geoghegan
On Thu, Jan 18, 2018 at 2:05 PM, Peter Geoghegan <pg@bowt.ie> wrote: > Review of your patch: > > * SerializedReindexState could use some comments. At least a one liner > stating its basic purpose. Added a comment. > * The "System index reindexing support" comment block could do with a > passing acknowledgement of the fact that this is serialized for > parallel workers. Done. > * Maybe the "Serialize reindex state" comment within > InitializeParallelDSM() should instead say something like "Serialize > indexes-pending-reindex state". That would require corresponding changes in a bunch of other places, possibly including the function names. I think it's better to keep the function names shorter and the comments matching the function names, so I did not make this change. > Other than that, looks good to me. It's a simple patch with a clear purpose. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 19, 2018 at 4:52 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Other than that, looks good to me. It's a simple patch with a clear purpose. > > Committed. Cool. Clarity on what I should do about parallel_leader_participation in the next revision would be useful at this point. You seem to either want me to remove it from consideration entirely, or to remove the code that specifically disallows a "degenerate parallel CREATE INDEX". I need a final answer on that. -- Peter Geoghegan
On Fri, Jan 19, 2018 at 12:16 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Jan 19, 2018 at 4:52 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Other than that, looks good to me. It's a simple patch with a clear purpose. >> >> Committed. > > Cool. > > Clarity on what I should do about parallel_leader_participation in the > next revision would be useful at this point. You seem to either want > me to remove it from consideration entirely, or to remove the code > that specifically disallows a "degenerate parallel CREATE INDEX". I > need a final answer on that. Right. I do think that we should do one of those things, and I lean towards removing it entirely, but I'm not entirely sure. Rather than making an executive decision immediately, I'd like to wait a few days to give others a chance to comment. I am hoping that we might get some other opinions, especially from Thomas who implemented parallel_leader_participation, or maybe Amit who has been reviewing recently, or anyone else who is paying attention to this thread. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jan 20, 2018 at 6:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 19, 2018 at 12:16 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> Clarity on what I should do about parallel_leader_participation in the >> next revision would be useful at this point. You seem to either want >> me to remove it from consideration entirely, or to remove the code >> that specifically disallows a "degenerate parallel CREATE INDEX". I >> need a final answer on that. > > Right. I do think that we should do one of those things, and I lean > towards removing it entirely, but I'm not entirely sure. Rather > than making an executive decision immediately, I'd like to wait a few > days to give others a chance to comment. I am hoping that we might get > some other opinions, especially from Thomas who implemented > parallel_leader_participation, or maybe Amit who has been reviewing > recently, or anyone else who is paying attention to this thread. Well, I see parallel_leader_participation as having these reasons to exist: 1. Gather could in rare circumstances not run the plan in the leader. This can hide bugs. It's good to be able to force that behaviour for testing. 2. Plans that tie up the leader process for a long time cause the tuple queues to block, which reduces parallelism. I speculate that some people might want to turn that off in production, but at the very least it seems useful for certain kinds of performance testing to be able to remove this complication from the picture. 3. The planner's estimations of parallel leader contribution are somewhat bogus, especially if the startup cost is high. It's useful to be able to remove that problem from the picture sometimes, at least for testing and development work. Parallel CREATE INDEX doesn't have any of those problems. The only reason I can see for it to respect parallel_leader_participation = off is for consistency with Gather. If someone decides to run their cluster with that setting, then it's slightly odd if CREATE INDEX scans and sorts with one extra process, but it doesn't seem like a big deal. I vote for removing the GUC from consideration for now (ie always use the leader), and revisiting the question again later when we have more experience or if the parallel degree logic becomes more sophisticated in future. -- Thomas Munro http://www.enterprisedb.com
On Thu, Jan 18, 2018 at 5:53 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I guess the fact that tuplesort_leader_wait() could be optional means > that it could be removed, which means that we could in fact throw out > the last condition variable within tuplesort.c, and fully rely on > using a barrier for everything within nbtsort.c. However, > tuplesort_leader_wait() seems kind of like something that we should > have on general principle. And, more importantly, it would be tricky > to use a barrier even for this, because we still have that baked-in > assumption that nworkers_launched is the single source of truth about > the number of participants. On third thought, tuplesort_leader_wait() should be removed entirely, and tuplesort.c should get entirely out of the IPC business (it should do the bare minimum of recording/reading a little state in shared memory, while knowing nothing about condition variables, barriers, or anything declared in parallel.h). Thinking about dealing with 2 spools at once clinched it for me -- calling tuplesort_leader_wait() for both underlying Tuplesortstates was silly, especially because there is still a need for an nbtsort.c-specific wait for workers to fill-in ambuild stats. When I said "tuplesort_leader_wait() seems kind of like something that we should have on general principle", I was wrong. It's normal for parallel workers to have all kinds of overlapping responsibilities, and tuplesort_leader_wait() was doing something that I now imagine isn't desirable to most callers. They can easily provide something equivalent at a higher level. Besides, they'll very likely be forced to anyway, due to some high level, caller-specific need -- which is exactly what we see within nbtsort.c. Attached patch details: * The patch synchronizes processes used the approach just described. Note that this allowed me to remove several #include statements within tuplesort.c. * The patch uses only a single condition variable for a single wait within nbtsort.c, for the leader. No barriers are used at all (and, as I said, tuplesort.c doesn't use condition variables anymore). Since things are now very simple, I can't imagine anyone still arguing for the use of barriers. Note that I'm still relying on nworkers_launched as the single source of truth on the number of participants that can be expected to eventually show up (even if they end up doing zero real work). This should be fine, because I think that it will end up being formally guaranteed to be reliable by the work currently underway from Robert and Amit. But even if I'm wrong about things going that way, and it turns out that the leader needs to decide how many putative launched workers don't "get lost" due to fork() failure (the approach which Amit apparently advocates), then there *still* isn't much that needs to change. Ultimately, the leader needs to have the exact number of workers that participated, because that's fundamental to the tuplesort approach to parallel sort. If necessary, the leader can just figure it out in whatever way it likes at one central point within nbtsort.c, before the leader calls its main spool's tuplesort_begin_index_btree() -- that can happen fairly late in the process. Actually doing that (and not just using nworkers_launched) seems questionable to me, because it would be almost the first thing that the leader would do after starting parallel mode -- why not just have the parallel infrastructure do it for us, and for everyone else? If the new tuplesort infrastructure is used in the executor at some future date, then the leader will still need to figure out the number of workers that reached tuplesort_begin* some other way. This shouldn't be surprising to anyone -- tuplesort.h is very clear on this point. * I revised the tuplesort.h contract to account for the developments already described (mostly that I've removed tuplesort_leader_wait()). * The patch makes the IPC wait event CREATE INDEX specific, since tuplesort no longer does any waits of its own -- it's now called ParallelCreateIndexScan. Patch also removes the second wait event entirely (the one that we called ParallelSortTapeHandover). * We now support index builds on catalogs. I rebased on top of Robert's recent "REINDEX state in parallel workers" commit, 29d58fd3. Note that there was a bug here in error paths that caused Robert's "can't happen" error to be raised (the PG_CATCH() block call to ResetReindexProcessing()). I fixed this in passing, by simply removing that one "can't happen" error. Note that ResetReindexProcessing() is only called directly within reindex_index()/IndexCheckExclusion(). This made the idea of preserving the check in a diminished form (#includ'ing parallel.h within index.c, in order to check if we're a parallel worker as a condition of raising that "can't happen" error) seem unnecessary. * The patch does not alter anything about parallel_leader_participation, except the alterations that Robert requested to the docs (he requested these alterations on the assumption that we won't end up doing nothing special with parallel_leader_participation). I am waiting for a final decision on what is to be done about parallel_leader_participation, but for now I've changed nothing. * I removed BufFileView(). I also renamed BufFileViewAppend() to BufFileAppend(). * I performed some other minor tweaks, including some requested by Robert in his most recent round of review. Thanks -- Peter Geoghegan
Attachment
On Sat, Jan 20, 2018 at 2:57 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sat, Jan 20, 2018 at 6:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Jan 19, 2018 at 12:16 PM, Peter Geoghegan <pg@bowt.ie> wrote: >>> Clarity on what I should do about parallel_leader_participation in the >>> next revision would be useful at this point. You seem to either want >>> me to remove it from consideration entirely, or to remove the code >>> that specifically disallows a "degenerate parallel CREATE INDEX". I >>> need a final answer on that. >> >> Right. I do think that we should do one of those things, and I lean >> towards removing it entirely, but I'm not entirely sure. Rather >> than making an executive decision immediately, I'd like to wait a few >> days to give others a chance to comment. I am hoping that we might get >> some other opinions, especially from Thomas who implemented >> parallel_leader_participation, or maybe Amit who has been reviewing >> recently, or anyone else who is paying attention to this thread. > > Well, I see parallel_leader_participation as having these reasons to exist: > > 1. Gather could in rare circumstances not run the plan in the leader. > This can hide bugs. It's good to be able to force that behaviour for > testing. > Or reverse is also possible which means the workers won't get chance to run the plan in which case we can use parallel_leader_participation = off to test workers behavior. As said before, I see only that as the reason to keep parallel_leader_participation in this patch. If we decide to do that way, then I think we should remove the code that specifically disallows a "degenerate parallel CREATE INDEX" as that seems to be confusing. If we go this way, then I think we should use the wording suggested by Robert in one of its email [1] to describe the usage of parallel_leader_participation. BTW, is there any other way for "parallel create index" to force that the work is done by workers? I am insisting on having something which can test the code path in workers because we have found quite a few bugs using that idea. [1] - https://www.postgresql.org/message-id/CA%2BTgmoYN-YQU9JsGQcqFLovZ-C%2BXgp1_xhJQad%3DcunGG-_p5gg%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jan 19, 2018 at 6:52 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Or reverse is also possible which means the workers won't get chance > to run the plan in which case we can use parallel_leader_participation > = off to test workers behavior. As said before, I see only that as > the reason to keep parallel_leader_participation in this patch. If we > decide to do that way, then I think we should remove the code that > specifically disallows a "degenerate parallel CREATE INDEX" as that > seems to be confusing. If we go this way, then I think we should use > the wording suggested by Robert in one of its email [1] to describe > the usage of parallel_leader_participation. I agree that parallel_leader_participation is only useful for testing in the context of parallel CREATE INDEX. My concern with allowing a "degenerate parallel CREATE INDEX" to go ahead is that parallel_leader_participation generally isn't just intended for testing by hackers (if it was, then I wouldn't care). But I'm now more than willing to let this go. > BTW, is there any other way for "parallel create index" to force that > the work is done by workers? I am insisting on having something which > can test the code path in workers because we have found quite a few > bugs using that idea. I agree that this is essential (more so than supporting parallel_leader_participation). You can use the parallel_workers table storage parameter for this. When the storage param has been set, we don't care about the amount of memory available to each worker. You can stress-test the implementation as needed. (The storage param does care about max_parallel_maintenance_workers, but you can set that as high as you like.) -- Peter Geoghegan
On Sat, Jan 20, 2018 at 8:33 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Jan 19, 2018 at 6:52 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > >> BTW, is there any other way for "parallel create index" to force that >> the work is done by workers? I am insisting on having something which >> can test the code path in workers because we have found quite a few >> bugs using that idea. > > I agree that this is essential (more so than supporting > parallel_leader_participation). You can use the parallel_workers table > storage parameter for this. When the storage param has been set, we > don't care about the amount of memory available to each worker. You > can stress-test the implementation as needed. (The storage param does > care about max_parallel_maintenance_workers, but you can set that as > high as you like.) > Right, but I think using parallel_leader_participation, you can do it reliably and probably write some regression tests which can complete in a predictable time. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jan 19, 2018 at 8:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Right, but I think using parallel_leader_participation, you can do it > reliably and probably write some regression tests which can complete > in a predictable time. Do what reliably? Guarantee that the leader will not participate as a worker, but that workers will be used? If so, yes, you can get that. The only issue is that you may not be able to launch parallel workers due to hitting a limit like max_parallel_workers, in which case you'll get a serial index build despite everything. Nothing we can do about that, though. -- Peter Geoghegan
On Sat, Jan 20, 2018 at 10:20 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Jan 19, 2018 at 8:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Right, but I think using parallel_leader_participation, you can do it >> reliably and probably write some regression tests which can complete >> in a predictable time. > > Do what reliably? Guarantee that the leader will not participate as a > worker, but that workers will be used? If so, yes, you can get that. > Yes, that's what I mean. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Jan 20, 2018 at 7:03 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Jan 18, 2018 at 5:53 PM, Peter Geoghegan <pg@bowt.ie> wrote: > > Attached patch details: > > * The patch synchronizes processes used the approach just described. > Note that this allowed me to remove several #include statements within > tuplesort.c. > > * The patch uses only a single condition variable for a single wait > within nbtsort.c, for the leader. No barriers are used at all (and, as > I said, tuplesort.c doesn't use condition variables anymore). Since > things are now very simple, I can't imagine anyone still arguing for > the use of barriers. > > Note that I'm still relying on nworkers_launched as the single source > of truth on the number of participants that can be expected to > eventually show up (even if they end up doing zero real work). This > should be fine, because I think that it will end up being formally > guaranteed to be reliable by the work currently underway from Robert > and Amit. But even if I'm wrong about things going that way, and it > turns out that the leader needs to decide how many putative launched > workers don't "get lost" due to fork() failure (the approach which > Amit apparently advocates), then there *still* isn't much that needs > to change. > > Ultimately, the leader needs to have the exact number of workers that > participated, because that's fundamental to the tuplesort approach to > parallel sort. > I think I can see why this patch needs that. Is it mainly for the work you are doing in _bt_leader_heapscan where you are waiting for all the workers to be finished? If necessary, the leader can just figure it out in > whatever way it likes at one central point within nbtsort.c, before > the leader calls its main spool's tuplesort_begin_index_btree() -- > that can happen fairly late in the process. Actually doing that (and > not just using nworkers_launched) seems questionable to me, because it > would be almost the first thing that the leader would do after > starting parallel mode -- why not just have the parallel > infrastructure do it for us, and for everyone else? > I think till now we don't have any such requirement, but if it is must for this patch, then I don't think it is tough to do that. We need to write an API WaitForParallelWorkerToAttach() and then call for each launched worker or maybe WaitForParallelWorkersToAttach() which can wait for all workers to attach and report how many have successfully attached. It will have functionality of WaitForBackgroundWorkerStartup and additionally it needs to check if the worker is attached to the error queue. We already have similar API (WaitForReplicationWorkerAttach) for logical replication workers as well. Note that it might have a slight impact on the performance because with this you need to wait for the workers to startup before doing any actual work, but I don't think it should be noticeable for large operations especially for operations like parallel create index. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jan 19, 2018 at 9:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think I can see why this patch needs that. Is it mainly for the > work you are doing in _bt_leader_heapscan where you are waiting for > all the workers to be finished? Yes, though it's also needed for the leader tuplesort. It needs to be able to discover worker runs by looking for temp files named 0 through to $NWORKERS - 1. The problem with seeing who shows up after a period of time, and having the leader arbitrarily determine that to be the total number of participants (while blocking further participants from joining) is that I don't know *how long to wait*. This would probably work okay for parallel CREATE INDEX, when the leader participates as a worker, because you can check only when the leader is finished acting as a worker. It stands to reason that that's enough time for worker processes to at least show up, and be seen to show up. We can use the duration of the leader's participation as a worker as a natural way to decide how long to wait. But what when the leader doesn't participate as a worker, for whatever reason? Other uses for parallel tuplesort might typically have much less leader participation as compared to parallel CREATE INDEX. In short, ISTM that seeing who shows up is a bad strategy for parallel tuplesort. > I think till now we don't have any such requirement, but if it is must > for this patch, then I don't think it is tough to do that. We need to > write an API WaitForParallelWorkerToAttach() and then call for each > launched worker or maybe WaitForParallelWorkersToAttach() which can > wait for all workers to attach and report how many have successfully > attached. It will have functionality of > WaitForBackgroundWorkerStartup and additionally it needs to check if > the worker is attached to the error queue. We already have similar > API (WaitForReplicationWorkerAttach) for logical replication workers > as well. Note that it might have a slight impact on the performance > because with this you need to wait for the workers to startup before > doing any actual work, but I don't think it should be noticeable for > large operations especially for operations like parallel create index. Actually, though it doesn't really look like it from the way things are structured within nbtsort.c, I don't need to wait for workers to start up (call the WaitForParallelWorkerToAttach() function you sketched) before doing any real work within the leader. The leader can participate as a worker, and only do this check afterwards. That will work because the leader Tuplesortstate has yet to do any real work. Nothing stops me from adding a new function to tuplesort, for the leader, that lets the leader say: "New plan -- you should now expect this many participants" (leader takes this reliable number from eventual call to WaitForParallelWorkerToAttach()). I admit that I had no idea that there is this issue with nworkers_launched until very recently. But then, that field has absolutely no comments. -- Peter Geoghegan
On Sun, Jan 21, 2018 at 1:39 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Jan 19, 2018 at 9:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > Actually, though it doesn't really look like it from the way things > are structured within nbtsort.c, I don't need to wait for workers to > start up (call the WaitForParallelWorkerToAttach() function you > sketched) before doing any real work within the leader. The leader can > participate as a worker, and only do this check afterwards. That will > work because the leader Tuplesortstate has yet to do any real work. > Nothing stops me from adding a new function to tuplesort, for the > leader, that lets the leader say: "New plan -- you should now expect > this many participants" (leader takes this reliable number from > eventual call to WaitForParallelWorkerToAttach()). > > I admit that I had no idea that there is this issue with > nworkers_launched until very recently. But then, that field has > absolutely no comments. > It would have been better if there were some comments besides that field, but I think it has been covered at another place in the code. See comments in LaunchParallelWorkers(). /* * Start workers. * * The caller must be able to tolerate ending up with fewer workers than * expected, so there is no need to throw an error here if registration * fails. It wouldn't help much anyway, because registering the worker in * no way guarantees that it will start up and initialize successfully. */ -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Jan 20, 2018 at 8:38 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > It would have been better if there were some comments besides that > field, but I think it has been covered at another place in the code. > See comments in LaunchParallelWorkers(). > > /* > * Start workers. > * > * The caller must be able to tolerate ending up with fewer workers than > * expected, so there is no need to throw an error here if registration > * fails. It wouldn't help much anyway, because registering the worker in > * no way guarantees that it will start up and initialize successfully. > */ Why is this okay for Gather nodes, though? nodeGather.c looks at pcxt->nworkers_launched during initialization, and appears to at least trust it to indicate that more than zero actually-launched workers will also show up when "nworkers_launched > 0". This trust seems critical when parallel_leader_participation is off, because "node->nreaders == 0" overrides the parallel_leader_participation GUC's setting (note that node->nreaders comes directly from pcxt->nworkers_launched). If zero workers show up, and parallel_leader_participation is off, but pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather never make forward progress? Parallel CREATE INDEX does go a bit further. It assumes that nworkers_launched *exactly* indicates the number of workers that successfully underwent parallel initialization, and therefore can be expected to show up. Is there actually a meaningful difference between the way nworkers_launched is depended upon in each case, though? -- Peter Geoghegan
On Mon, Jan 22, 2018 at 12:50 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Sat, Jan 20, 2018 at 8:38 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> It would have been better if there were some comments besides that >> field, but I think it has been covered at another place in the code. >> See comments in LaunchParallelWorkers(). >> >> /* >> * Start workers. >> * >> * The caller must be able to tolerate ending up with fewer workers than >> * expected, so there is no need to throw an error here if registration >> * fails. It wouldn't help much anyway, because registering the worker in >> * no way guarantees that it will start up and initialize successfully. >> */ > > Why is this okay for Gather nodes, though? nodeGather.c looks at > pcxt->nworkers_launched during initialization, and appears to at least > trust it to indicate that more than zero actually-launched workers > will also show up when "nworkers_launched > 0". This trust seems critical > when parallel_leader_participation is off, because "node->nreaders == > 0" overrides the parallel_leader_participation GUC's setting (note > that node->nreaders comes directly from pcxt->nworkers_launched). If > zero workers show up, and parallel_leader_participation is off, but > pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather > never make forward progress? > Ideally, that situation should be detected and we should throw an error, but that doesn't happen today. However, it will be handled with Robert's patch on the other thread for CF entry [1]. [1] - https://commitfest.postgresql.org/16/1341/ -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sun, Jan 21, 2018 at 6:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Why is this okay for Gather nodes, though? nodeGather.c looks at >> pcxt->nworkers_launched during initialization, and appears to at least >> trust it to indicate that more than zero actually-launched workers >> will also show up when "nworkers_launched > 0". This trust seems critical >> when parallel_leader_participation is off, because "node->nreaders == >> 0" overrides the parallel_leader_participation GUC's setting (note >> that node->nreaders comes directly from pcxt->nworkers_launched). If >> zero workers show up, and parallel_leader_participation is off, but >> pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather >> never make forward progress? > > Ideally, that situation should be detected and we should throw an > error, but that doesn't happen today. However, it will be handled > with Robert's patch on the other thread for CF entry [1]. I knew that, but I was confused by your sketch of the WaitForParallelWorkerToAttach() API [1]. Specifically, your suggestion that the problem was unique to nbtsort.c, or was at least something that nbtsort.c had to take a special interest in. It now appears more like a general problem with a general solution, and likely one that won't need *any* changes to code in places like nodeGather.c (or nbtsort.c, in the case of my patch). I guess that you meant that parallel CREATE INDEX is the first thing to care about the *precise* number of nworkers_launched -- that is kind of a new thing. That doesn't seem like it makes any practical difference to us, though. I don't see why nbtsort.c should take a special interest in this problem, for example by calling WaitForParallelWorkerToAttach() itself. I may have missed something, but right now ISTM that it would be risky to make the API anything other than what both nodeGather.c and nbtsort.c already expect (that they'll either have nworkers_launched workers show up, or be able to propagate an error). [1] https://postgr.es/m/CAA4eK1KzvXTCFF8inhcEviUPxp4yWCS3rZuwjfqMttf75x2rvA@mail.gmail.com -- Peter Geoghegan
On Mon, Jan 22, 2018 at 10:36 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Sun, Jan 21, 2018 at 6:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> Why is this okay for Gather nodes, though? nodeGather.c looks at >>> pcxt->nworkers_launched during initialization, and appears to at least >>> trust it to indicate that more than zero actually-launched workers >>> will also show up when "nworkers_launched > 0". This trust seems critical >>> when parallel_leader_participation is off, because "node->nreaders == >>> 0" overrides the parallel_leader_participation GUC's setting (note >>> that node->nreaders comes directly from pcxt->nworkers_launched). If >>> zero workers show up, and parallel_leader_participation is off, but >>> pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather >>> never make forward progress? >> >> Ideally, that situation should be detected and we should throw an >> error, but that doesn't happen today. However, it will be handled >> with Robert's patch on the other thread for CF entry [1]. > > I knew that, but I was confused by your sketch of the > WaitForParallelWorkerToAttach() API [1]. Specifically, your suggestion > that the problem was unique to nbtsort.c, or was at least something > that nbtsort.c had to take a special interest in. It now appears more > like a general problem with a general solution, and likely one that > won't need *any* changes to code in places like nodeGather.c (or > nbtsort.c, in the case of my patch). > > I guess that you meant that parallel CREATE INDEX is the first thing > to care about the *precise* number of nworkers_launched -- that is > kind of a new thing. That doesn't seem like it makes any practical > difference to us, though. I don't see why nbtsort.c should take a > special interest in this problem, for example by calling > WaitForParallelWorkerToAttach() itself. I may have missed something, > but right now ISTM that it would be risky to make the API anything > other than what both nodeGather.c and nbtsort.c already expect (that > they'll either have nworkers_launched workers show up, or be able to > propagate an error). > The difference is that nodeGather.c doesn't have any logic like the one you have in _bt_leader_heapscan where the patch waits for each worker to increment nparticipantsdone. For Gather node, we do such a thing (wait for all workers to finish) by calling WaitForParallelWorkersToFinish which will have the capability after Robert's patch to detect if any worker is exited abnormally (fork failure or failed before attaching to the error queue). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 22, 2018 at 3:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > The difference is that nodeGather.c doesn't have any logic like the > one you have in _bt_leader_heapscan where the patch waits for each > worker to increment nparticipantsdone. For Gather node, we do such a > thing (wait for all workers to finish) by calling > WaitForParallelWorkersToFinish which will have the capability after > Robert's patch to detect if any worker is exited abnormally (fork > failure or failed before attaching to the error queue). FWIW, I don't think that that's really much of a difference. ExecParallelFinish() calls WaitForParallelWorkersToFinish(), which is similar to how _bt_end_parallel() calls WaitForParallelWorkersToFinish() in the patch. The _bt_leader_heapscan() condition variable wait for workers that you refer to is quite a bit like how gather_readnext() behaves. It generally checks to make sure that all tuple queues are done. gather_readnext() can wait for developments using WaitLatch(), to make sure every tuple queue is visited, with all output reliably consumed. This doesn't look all that similar to _bt_leader_heapscan(), I suppose, but I think that that's only because it's normal for all output to become available all at once for nbtsort.c workers. The startup cost is close to or actually the same as the total cost, as it *always* is for sort nodes. -- Peter Geoghegan
On Thu, Jan 18, 2018 at 9:22 AM, Robert Haas <robertmhaas@gmail.com> wrote: > But, hang on a minute. Why do the workers need to wait for the leader > anyway? Can't they just exit once they're done sorting? I think the > original reason why this ended up in the patch is that we needed the > leader to assume ownership of the tapes to avoid having the tapes get > blown away when the worker exits. But, IIUC, with sharedfileset.c, > that problem no longer exists. The tapes are jointly owned by all of > the cooperating backends and the last one to detach from it will > remove them. So, if the worker sorts, advertises that it's done in > shared memory, and exits, then nothing should get removed and the > leader can adopt the tapes whenever it gets around to it. BTW, I want to point out that using the shared fileset infrastructure is only a very small impediment to adding randomAccess support. If we really wanted to support randomAccess for the leader's tapeset, while recycling blocks from worker BufFiles, it looks like all we'd have to do is change PathNameOpenTemporaryFile() to open files O_RDWR, rather than O_RDONLY (shared fileset BufFiles that are opened after export always have O_RDONLY segments -- we'd also have to change some assertions, as well as some comments). Overall, this approach looks straightforward, and isn't something that I can find an issue with after an hour or so of manual testing. Now, I'm not actually suggesting we go that way. As you know, randomAccess isn't used by CREATE INDEX, and randomAccess may never be needed for any parallel sort operation. More importantly, Thomas made PathNameOpenTemporaryFile() use O_RDONLY for a reason, and I don't want to trade one special case (randomAccess disallowed for parallel tuplesort leader tapeset) in exchange for another one (the logtape.c calls to BufFileOpenShared() ask for read-write BufFiles, not read-only BufFiles). I'm pointing this out because this is something that should increase confidence in the changes I've proposed to logtape.c. The fact that randomAccess support *would* be straightforward is a sign that I haven't accidentally introduced some other assumption, or special case. -- Peter Geoghegan
On Tue, Jan 23, 2018 at 1:45 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jan 22, 2018 at 3:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> The difference is that nodeGather.c doesn't have any logic like the >> one you have in _bt_leader_heapscan where the patch waits for each >> worker to increment nparticipantsdone. For Gather node, we do such a >> thing (wait for all workers to finish) by calling >> WaitForParallelWorkersToFinish which will have the capability after >> Robert's patch to detect if any worker is exited abnormally (fork >> failure or failed before attaching to the error queue). > > FWIW, I don't think that that's really much of a difference. > > ExecParallelFinish() calls WaitForParallelWorkersToFinish(), which is > similar to how _bt_end_parallel() calls > WaitForParallelWorkersToFinish() in the patch. The > _bt_leader_heapscan() condition variable wait for workers that you > refer to is quite a bit like how gather_readnext() behaves. It > generally checks to make sure that all tuple queues are done. > gather_readnext() can wait for developments using WaitLatch(), to make > sure every tuple queue is visited, with all output reliably consumed. > The difference lies in the fact that in gather_readnext, we use tuple queue mechanism which has the capability to detect that the workers are stopped/exited whereas _bt_leader_heapscan doesn't have any such capability, so I think it will loop forever. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 22, 2018 at 6:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> FWIW, I don't think that that's really much of a difference. >> >> ExecParallelFinish() calls WaitForParallelWorkersToFinish(), which is >> similar to how _bt_end_parallel() calls >> WaitForParallelWorkersToFinish() in the patch. The >> _bt_leader_heapscan() condition variable wait for workers that you >> refer to is quite a bit like how gather_readnext() behaves. It >> generally checks to make sure that all tuple queues are done. >> gather_readnext() can wait for developments using WaitLatch(), to make >> sure every tuple queue is visited, with all output reliably consumed. >> > > The difference lies in the fact that in gather_readnext, we use tuple > queue mechanism which has the capability to detect that the workers > are stopped/exited whereas _bt_leader_heapscan doesn't have any such > capability, so I think it will loop forever. _bt_leader_heapscan() can detect when workers exit early, at least in the vast majority of cases. It can do this simply by processing interrupts and automatically propagating any error -- nothing special about that. It can also detect when workers have finished successfully, because of course, that's the main reason for its existence. What remains, exactly? I don't know that much about tuple queues, but from a quick read I guess you might be talking about shm_mq_receive() + shm_mq_wait_internal(). It's not obvious that that will work in all cases ("Note that if handle == NULL, and the process fails to attach, we'll potentially get stuck here forever"). Also, I don't see how this addresses the parallel_leader_participation issue I raised. -- Peter Geoghegan
On Tue, Jan 23, 2018 at 8:43 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jan 22, 2018 at 6:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> FWIW, I don't think that that's really much of a difference. >>> >>> ExecParallelFinish() calls WaitForParallelWorkersToFinish(), which is >>> similar to how _bt_end_parallel() calls >>> WaitForParallelWorkersToFinish() in the patch. The >>> _bt_leader_heapscan() condition variable wait for workers that you >>> refer to is quite a bit like how gather_readnext() behaves. It >>> generally checks to make sure that all tuple queues are done. >>> gather_readnext() can wait for developments using WaitLatch(), to make >>> sure every tuple queue is visited, with all output reliably consumed. >>> >> >> The difference lies in the fact that in gather_readnext, we use tuple >> queue mechanism which has the capability to detect that the workers >> are stopped/exited whereas _bt_leader_heapscan doesn't have any such >> capability, so I think it will loop forever. > > _bt_leader_heapscan() can detect when workers exit early, at least in > the vast majority of cases. It can do this simply by processing > interrupts and automatically propagating any error -- nothing special > about that. It can also detect when workers have finished > successfully, because of course, that's the main reason for its > existence. What remains, exactly? > Will it able to detect fork failure or if worker exits before attaching to error queue? I think you can once try it by forcing fork failure in do_start_bgworker and see the behavior of _bt_leader_heapscan. I could have tried and let you know the results, but the latest patch doesn't seem to apply cleanly. > I don't know that much about tuple queues, but from a quick read I > guess you might be talking about shm_mq_receive() + > shm_mq_wait_internal(). It's not obvious that that will work in all > cases ("Note that if handle == NULL, and the process fails to attach, > we'll potentially get stuck here forever"). Also, I don't see how this > addresses the parallel_leader_participation issue I raised. > I am talking about shm_mq_receive->shm_mq_counterparty_gone. In shm_mq_counterparty_gone, it can detect if the worker is gone by using GetBackgroundWorkerPid. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 22, 2018 at 10:13 PM, Peter Geoghegan <pg@bowt.ie> wrote: > _bt_leader_heapscan() can detect when workers exit early, at least in > the vast majority of cases. It can do this simply by processing > interrupts and automatically propagating any error -- nothing special > about that. It can also detect when workers have finished > successfully, because of course, that's the main reason for its > existence. What remains, exactly? As Amit says, what remains is the case where fork() fails or the worker dies before it reaches the line in ParallelWorkerMain that reads shm_mq_set_sender(mq, MyProc). In those cases, no error will be signaled until you call WaitForParallelWorkersToFinish(). If you wait prior to that point for a number of workers equal to nworkers_launched, you will wait forever in those cases. I am going to repeat my previous suggest that we use a Barrier here. Given the discussion subsequent to my original proposal, this can be a lot simpler than what I suggested originally. Each worker does BarrierAttach() before beginning to read tuples (exiting if the phase returned is non-zero) and BarrierArriveAndDetach() when it's done sorting. The leader does BarrierAttach() before launching workers and BarrierArriveAndWait() when it's done sorting. If we don't do this, we're going to have to invent some other mechanism to count the participants that actually initialize successfully, but that seems like it's just duplicating code. This proposal has some minor advantages even when no fork() failure or similar occurs. If, for example, one or more workers take a long time to start, the leader doesn't have to wait for them before writing out the index. As soon as all the workers that attached to the Barrier have arrived at the end of phase 0, the leader can build a new tape set from all of the tapes that exist at that time. It does not need to wait for the remaining workers to start up and create empty tapes. This is only a minor advantage since we probably shouldn't be doing CREATE INDEX in parallel in the first place if the index build is so short that this scenario is likely to occur, but we get it basically for free, so why not? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 23, 2018 at 10:36 AM, Robert Haas <robertmhaas@gmail.com> wrote: > As Amit says, what remains is the case where fork() fails or the > worker dies before it reaches the line in ParallelWorkerMain that > reads shm_mq_set_sender(mq, MyProc). In those cases, no error will be > signaled until you call WaitForParallelWorkersToFinish(). If you wait > prior to that point for a number of workers equal to > nworkers_launched, you will wait forever in those cases. Another option might be to actually call WaitForParallelWorkersToFinish() in place of a condition variable or barrier, as Amit suggested at one point. > I am going to repeat my previous suggest that we use a Barrier here. > Given the discussion subsequent to my original proposal, this can be a > lot simpler than what I suggested originally. Each worker does > BarrierAttach() before beginning to read tuples (exiting if the phase > returned is non-zero) and BarrierArriveAndDetach() when it's done > sorting. The leader does BarrierAttach() before launching workers and > BarrierArriveAndWait() when it's done sorting. If we don't do this, > we're going to have to invent some other mechanism to count the > participants that actually initialize successfully, but that seems > like it's just duplicating code. I think that this closes the door to leader non-participation as anything other than a developer-only debug option, which might be fine. If parallel_leader_participation=off (or some way of getting the same behavior through a #define) is to be retained, then an artificial wait is required as a substitute for the leader's participation as a worker. -- Peter Geoghegan
On Tue, Jan 23, 2018 at 10:50 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Jan 23, 2018 at 10:36 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I am going to repeat my previous suggest that we use a Barrier here. >> Given the discussion subsequent to my original proposal, this can be a >> lot simpler than what I suggested originally. Each worker does >> BarrierAttach() before beginning to read tuples (exiting if the phase >> returned is non-zero) and BarrierArriveAndDetach() when it's done >> sorting. The leader does BarrierAttach() before launching workers and >> BarrierArriveAndWait() when it's done sorting. If we don't do this, >> we're going to have to invent some other mechanism to count the >> participants that actually initialize successfully, but that seems >> like it's just duplicating code. > > I think that this closes the door to leader non-participation as > anything other than a developer-only debug option, which might be > fine. If parallel_leader_participation=off (or some way of getting the > same behavior through a #define) is to be retained, then an artificial > wait is required as a substitute for the leader's participation as a > worker. This idea of an artificial wait seems pretty grotty to me. If we made it one second, would that be okay with Valgrind builds? And when it wasn't sufficient, wouldn't we be back to waiting forever? Finally, it's still not clear to me why nodeGather.c's use of parallel_leader_participation=off doesn't suffer from similar problems [1]. [1] https://postgr.es/m/CAH2-Wz=cAMX5btE1s=aTz7CLwzpEPm_NsUhAMAo5t5=1i9VcwQ@mail.gmail.com -- Peter Geoghegan
On Tue, Jan 23, 2018 at 2:11 PM, Peter Geoghegan <pg@bowt.ie> wrote: > Finally, it's still not clear to me why nodeGather.c's use of > parallel_leader_participation=off doesn't suffer from similar problems > [1]. Thomas and I just concluded that it does. See my email on the other thread just now. I thought that I had the failure cases all nailed down here now, but I guess not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 19, 2018 at 6:22 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> (3) >> erm, maybe it's a problem that errors occurring in workers while the >> leader is waiting at a barrier won't unblock the leader (we don't >> detach from barriers on abort/exit) -- I'll look into this. > > I think if there's an ERROR, the general parallelism machinery is > going to arrange to kill every worker, so nothing matters in that case > unless barrier waits ignore interrupts, which I'm pretty sure they > don't. (Also: if they do, I'll hit the ceiling; that would be awful.) (After talking this through with Robert off-list). Right, the CHECK_FOR_INTERRUPTS() in ConditionVariableSleep() handles errors from parallel workers. There is no problem here. -- Thomas Munro http://www.enterprisedb.com
On Wed, Jan 24, 2018 at 12:20 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Jan 23, 2018 at 10:36 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> As Amit says, what remains is the case where fork() fails or the >> worker dies before it reaches the line in ParallelWorkerMain that >> reads shm_mq_set_sender(mq, MyProc). In those cases, no error will be >> signaled until you call WaitForParallelWorkersToFinish(). If you wait >> prior to that point for a number of workers equal to >> nworkers_launched, you will wait forever in those cases. > > Another option might be to actually call > WaitForParallelWorkersToFinish() in place of a condition variable or > barrier, as Amit suggested at one point. > Yes, the only thing that is slightly worrying about using WaitForParallelWorkersToFinish is that backend leader needs to wait for workers to finish rather than just finishing sort related work. I think there shouldn't be much difference between when the sort is done and the workers actually finish the remaining resource cleanup. However, OTOH, if we are not okay with this solution and want to go with some kind of usage of barriers to solve this problem, then we can evaluate that as well, but I feel it is better if we can use the method which is used in other parallelism code to solve this problem (which is to use WaitForParallelWorkersToFinish). >> I am going to repeat my previous suggest that we use a Barrier here. >> Given the discussion subsequent to my original proposal, this can be a >> lot simpler than what I suggested originally. Each worker does >> BarrierAttach() before beginning to read tuples (exiting if the phase >> returned is non-zero) and BarrierArriveAndDetach() when it's done >> sorting. The leader does BarrierAttach() before launching workers and >> BarrierArriveAndWait() when it's done sorting. >> How does leader detect if one of the workers does BarrierAttach and then fails (either exits or error out) before doing BarrierArriveAndDetach? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 24, 2018 at 5:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> I am going to repeat my previous suggest that we use a Barrier here. >>> Given the discussion subsequent to my original proposal, this can be a >>> lot simpler than what I suggested originally. Each worker does >>> BarrierAttach() before beginning to read tuples (exiting if the phase >>> returned is non-zero) and BarrierArriveAndDetach() when it's done >>> sorting. The leader does BarrierAttach() before launching workers and >>> BarrierArriveAndWait() when it's done sorting. > > How does leader detect if one of the workers does BarrierAttach and > then fails (either exits or error out) before doing > BarrierArriveAndDetach? If you attach and then exit cleanly, that's a programming error and would cause anyone who runs BarrierArriveAndWait() to hang forever. If you attach and raise an error, the leader will receive an error message via CFI() and will then raise an error itself and terminate all workers during cleanup. -- Thomas Munro http://www.enterprisedb.com
On Wed, Jan 24, 2018 at 10:36 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Jan 24, 2018 at 5:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> I am going to repeat my previous suggest that we use a Barrier here. >>>> Given the discussion subsequent to my original proposal, this can be a >>>> lot simpler than what I suggested originally. Each worker does >>>> BarrierAttach() before beginning to read tuples (exiting if the phase >>>> returned is non-zero) and BarrierArriveAndDetach() when it's done >>>> sorting. The leader does BarrierAttach() before launching workers and >>>> BarrierArriveAndWait() when it's done sorting. >> >> How does leader detect if one of the workers does BarrierAttach and >> then fails (either exits or error out) before doing >> BarrierArriveAndDetach? > > If you attach and then exit cleanly, that's a programming error and > would cause anyone who runs BarrierArriveAndWait() to hang forever. > Right, but what if the worker dies due to something proc_exit(1) or something like that before calling BarrierArriveAndWait. I think this is part of the problem we have solved in WaitForParallelWorkersToFinish such that if the worker exits abruptly at any point due to some reason, the system should not hang. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 24, 2018 at 6:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Jan 24, 2018 at 10:36 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Wed, Jan 24, 2018 at 5:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>>> I am going to repeat my previous suggest that we use a Barrier here. >>>>> Given the discussion subsequent to my original proposal, this can be a >>>>> lot simpler than what I suggested originally. Each worker does >>>>> BarrierAttach() before beginning to read tuples (exiting if the phase >>>>> returned is non-zero) and BarrierArriveAndDetach() when it's done >>>>> sorting. The leader does BarrierAttach() before launching workers and >>>>> BarrierArriveAndWait() when it's done sorting. >>> >>> How does leader detect if one of the workers does BarrierAttach and >>> then fails (either exits or error out) before doing >>> BarrierArriveAndDetach? >> >> If you attach and then exit cleanly, that's a programming error and >> would cause anyone who runs BarrierArriveAndWait() to hang forever. >> > > Right, but what if the worker dies due to something proc_exit(1) or > something like that before calling BarrierArriveAndWait. I think this > is part of the problem we have solved in > WaitForParallelWorkersToFinish such that if the worker exits abruptly > at any point due to some reason, the system should not hang. Actually what I said before is no longer true: after commit 2badb5af, if you exit unexpectedly then the new ParallelWorkerShutdown() exit hook delivers PROCSIG_PARALLEL_MESSAGE (apparently after detaching from the error queue) and the leader aborts when it tries to read the error queue. I just hacked Parallel Hash like this: BarrierAttach(build_barrier); + if (ParallelWorkerNumber == 0) + { + pg_usleep(1000000); + proc_exit(1); + } Now I see: postgres=# select count(*) from foox r join foox s on r.a = s.a; ERROR: lost connection to parallel worker Using a debugger I can see the leader raising that error with this stack: HandleParallelMessages at parallel.c:890 ProcessInterrupts at postgres.c:3053 ConditionVariableSleep(cv=0x000000010a62e4c8, wait_event_info=134217737) at condition_variable.c:151 BarrierArriveAndWait(barrier=0x000000010a62e4b0, wait_event_info=134217737) at barrier.c:191 MultiExecParallelHash(node=0x00007ffcd9050b10) at nodeHash.c:312 MultiExecHash(node=0x00007ffcd9050b10) at nodeHash.c:112 MultiExecProcNode(node=0x00007ffcd9050b10) at execProcnode.c:502 ExecParallelHashJoin [inlined] ExecHashJoinImpl(pstate=0x00007ffcda01baa0, parallel='\x01') at nodeHashjoin.c:291 ExecParallelHashJoin(pstate=0x00007ffcda01baa0) at nodeHashjoin.c:582 -- Thomas Munro http://www.enterprisedb.com
On Tue, Jan 23, 2018 at 9:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Right, but what if the worker dies due to something proc_exit(1) or > something like that before calling BarrierArriveAndWait. I think this > is part of the problem we have solved in > WaitForParallelWorkersToFinish such that if the worker exits abruptly > at any point due to some reason, the system should not hang. I have used Thomas' chaos-monkey-fork-process.patch to verify: 1. The problem of fork failure causing nbtsort.c to wait forever is a real problem. Sure enough, the coding pattern within _bt_leader_heapscan() can cause us to wait forever even with commit 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a consequence of the patch not using tuple queues (it uses the new tuplesort sharing thing instead). 2. Simply adding a single call to WaitForParallelWorkersToFinish() within _bt_leader_heapscan() before waiting on our condition variable fixes the problem -- errors are reliably propagated, and we never end up waiting forever. 3. This short term fix works just as well with parallel_leader_participation=off. At this point, my preferred solution is for someone to go implement Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems like the logical person for the job). Once that's committed, I can post a new version of the patch that uses that new infrastructure -- I'll add a call to the new function, without changing anything else. Failing that, we could actually just use WaitForParallelWorkersToFinish(). I still don't want to use a barrier, mostly because it complicates parallel_leader_participation=off, something that Amit is in agreement with [2][3]. For now, I am waiting for feedback from Robert on next steps. [1] https://postgr.es/m/CAH2-Wzm6dF=g9LYwthgCqzRc4DzBE-8Tv28Yvg0XJ8Q6e4+cBQ@mail.gmail.com [2] https://postgr.es/m/CAA4eK1LEFd28p1kw2Fst9LzgBgfMbDEq9wPh9jWFC0ye6ce62A%40mail.gmail.com [3] https://postgr.es/m/CAA4eK1+a0OF4M231vBgPr_0Ygg_BNmRGZLiB7WQDE-FYBSyrGg@mail.gmail.com -- Peter Geoghegan
On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan <pg@bowt.ie> wrote: > I have used Thomas' chaos-monkey-fork-process.patch to verify: > > 1. The problem of fork failure causing nbtsort.c to wait forever is a > real problem. Sure enough, the coding pattern within > _bt_leader_heapscan() can cause us to wait forever even with commit > 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a > consequence of the patch not using tuple queues (it uses the new > tuplesort sharing thing instead). Just curious: does the attached also help? > 2. Simply adding a single call to WaitForParallelWorkersToFinish() > within _bt_leader_heapscan() before waiting on our condition variable > fixes the problem -- errors are reliably propagated, and we never end > up waiting forever. That does seem like a nice, simple solution and I am not against it. The niggling thing that bothers me about it, though, is that it requires the client of parallel.c to follow a slightly complicated protocol or risk a rare obscure failure mode, and recognise the cases where that's necessary. Specifically, if you're not blocking in a shm_mq wait loop, then you must make a call to this new interface before you do any other kind of latch wait, but if you get that wrong you'll probably not notice since fork failure is rare! It seems like it'd be nicer if we could figure out a way to make it so that any latch/CFI loop would automatically be safe against fork failure. The attached (if it actually works, I dunno) is the worst way, but I wonder if there is some way to traffic just a teensy bit more information from postmaster to leader so that it could be efficient... -- Thomas Munro http://www.enterprisedb.com
Attachment
On Wed, Jan 24, 2018 at 12:13 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan <pg@bowt.ie> wrote: >> I have used Thomas' chaos-monkey-fork-process.patch to verify: >> >> 1. The problem of fork failure causing nbtsort.c to wait forever is a >> real problem. Sure enough, the coding pattern within >> _bt_leader_heapscan() can cause us to wait forever even with commit >> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a >> consequence of the patch not using tuple queues (it uses the new >> tuplesort sharing thing instead). > > Just curious: does the attached also help? I can still reproduce the problem without the fix I described (which does work), using your patch instead. Offhand, I suspect that the way you set ParallelMessagePending may not always leave it set when it should be. >> 2. Simply adding a single call to WaitForParallelWorkersToFinish() >> within _bt_leader_heapscan() before waiting on our condition variable >> fixes the problem -- errors are reliably propagated, and we never end >> up waiting forever. > > That does seem like a nice, simple solution and I am not against it. > The niggling thing that bothers me about it, though, is that it > requires the client of parallel.c to follow a slightly complicated > protocol or risk a rare obscure failure mode, and recognise the cases > where that's necessary. Specifically, if you're not blocking in a > shm_mq wait loop, then you must make a call to this new interface > before you do any other kind of latch wait, but if you get that wrong > you'll probably not notice since fork failure is rare! It seems like > it'd be nicer if we could figure out a way to make it so that any > latch/CFI loop would automatically be safe against fork failure. It would certainly be nicer, but I don't see much risk if we add a comment next to nworkers_launched that said: Don't trust this until you've called (Amit's proposed) WaitForParallelWorkersToAttach() function, unless you're using the tuple queue infrastructure, which lets you not need to directly care about the distinction between a launched worker never starting, and a launched worker successfully completing. While I agree with what Robert said on the other thread -- "I guess that works, but it seems more like blind luck than good design. Parallel CREATE INDEX fails to be as "lucky" as Gather" -- that doesn't mean that that situation cannot be formalized. And even if it isn't formalized, then I think that that will probably be because Gather ends up doing almost the same thing. -- Peter Geoghegan
On Thu, Jan 25, 2018 at 9:28 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Jan 24, 2018 at 12:13 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan <pg@bowt.ie> wrote: >>> I have used Thomas' chaos-monkey-fork-process.patch to verify: >>> >>> 1. The problem of fork failure causing nbtsort.c to wait forever is a >>> real problem. Sure enough, the coding pattern within >>> _bt_leader_heapscan() can cause us to wait forever even with commit >>> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a >>> consequence of the patch not using tuple queues (it uses the new >>> tuplesort sharing thing instead). >> >> Just curious: does the attached also help? > > I can still reproduce the problem without the fix I described (which > does work), using your patch instead. > > Offhand, I suspect that the way you set ParallelMessagePending may not > always leave it set when it should be. Here's a version that works, and a minimal repro test module thing. Without 0003 applied, it hangs. With 0003 applied, it does this: postgres=# call test_fork_failure(); CALL postgres=# call test_fork_failure(); CALL postgres=# call test_fork_failure(); ERROR: lost connection to parallel worker postgres=# call test_fork_failure(); ERROR: lost connection to parallel worker I won't be surprised if 0003 is judged to be a horrendous abuse of the interrupt system, but these patches might at least be useful for understanding the problem. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Wed, Jan 24, 2018 at 5:31 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here's a version that works, and a minimal repro test module thing. > Without 0003 applied, it hangs. I can confirm that this version does in fact fix the problem with parallel CREATE INDEX hanging in the event of (simulated) worker fork() failure. And, it seems to have at least one tiny advantage over the other approaches I was talking about that you didn't mention, which is that we never have to wait until the leader stops participating as a worker before an error is raised. IOW, either the whole parallel CREATE INDEX operation throws an error at an early point in the CREATE INDEX, or the CREATE INDEX completely succeeds. Obviously, the other, stated advantage is more relevant: *everyone* automatically doesn't have to worry about nworkers_launched being inaccurate this way, including code that gets away with this today only due to using a tuple queue, such as nodeGather.c, but may not always get away with it in the future. I've run out of time to assess what you've done here in any real depth. For now, I will say that this approach seems interesting to me. I'll take a closer look tomorrow. -- Peter Geoghegan
On Thu, Jan 25, 2018 at 1:24 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Jan 23, 2018 at 9:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Right, but what if the worker dies due to something proc_exit(1) or >> something like that before calling BarrierArriveAndWait. I think this >> is part of the problem we have solved in >> WaitForParallelWorkersToFinish such that if the worker exits abruptly >> at any point due to some reason, the system should not hang. > > I have used Thomas' chaos-monkey-fork-process.patch to verify: > > 1. The problem of fork failure causing nbtsort.c to wait forever is a > real problem. Sure enough, the coding pattern within > _bt_leader_heapscan() can cause us to wait forever even with commit > 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a > consequence of the patch not using tuple queues (it uses the new > tuplesort sharing thing instead). > > 2. Simply adding a single call to WaitForParallelWorkersToFinish() > within _bt_leader_heapscan() before waiting on our condition variable > fixes the problem -- errors are reliably propagated, and we never end > up waiting forever. > > 3. This short term fix works just as well with > parallel_leader_participation=off. > > At this point, my preferred solution is for someone to go implement > Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems > like the logical person for the job). > I can implement it and share a prototype patch with you which you can use to test parallel sort stuff. I would like to highlight the difference which you will see with WaitForParallelWorkersToAttach as compare to WaitForParallelWorkersToFinish() is that the former will give you how many of nworkers_launched workers are actually launched whereas latter gives an error if any of the expected workers is not launched. I feel former is good and your proposed way of calling it after the leader is done with its work has alleviated the minor disadvantage of this API which is that we need for workers to startup. However, now I see that you and Thomas are trying to find a different way to overcome this problem differently, so not sure if I should go ahead or not. I have seen that you told you wanted to look at Thomas's proposed stuff carefully tomorrow, so I will wait for you guys to decide which way is appropriate. > Once that's committed, I can > post a new version of the patch that uses that new infrastructure -- > I'll add a call to the new function, without changing anything else. > Failing that, we could actually just use > WaitForParallelWorkersToFinish(). I still don't want to use a barrier, > mostly because it complicates parallel_leader_participation=off, > something that Amit is in agreement with [2][3]. > I think if we want we can use barrier API's to solve this problem, but I kind of have a feeling that it doesn't seem to be the most appropriate API, especially because existing API like WaitForParallelWorkersToFinish() can serve the need in a similar way. Just to conclude, following are proposed ways to solve this problem: 1. Implement a new API WaitForParallelWorkersToAttach and use that to solve this problem. Peter G. and Amit thinks, this is a good way to solve this problem. 2. Use existing API WaitForParallelWorkersToFinish to solve this problem. Peter G. feels that if API mentioned in #1 is not available, we can use this to solve the problem and I agree with that position. Thomas is not against it. 3. Use Thomas's new way to detect such failures. It is not clear to me at this stage if any one of us have accepted it to be the way to proceed, but Thomas and Peter G. want to investigate it further. 4. Use of Barrier API to solve this problem. Robert appears to be strongly in favor of this approach. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jan 26, 2018 at 11:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Jan 25, 2018 at 1:24 AM, Peter Geoghegan <pg@bowt.ie> wrote: >> On Tue, Jan 23, 2018 at 9:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> Right, but what if the worker dies due to something proc_exit(1) or >>> something like that before calling BarrierArriveAndWait. I think this >>> is part of the problem we have solved in >>> WaitForParallelWorkersToFinish such that if the worker exits abruptly >>> at any point due to some reason, the system should not hang. >> >> I have used Thomas' chaos-monkey-fork-process.patch to verify: >> >> 1. The problem of fork failure causing nbtsort.c to wait forever is a >> real problem. Sure enough, the coding pattern within >> _bt_leader_heapscan() can cause us to wait forever even with commit >> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a >> consequence of the patch not using tuple queues (it uses the new >> tuplesort sharing thing instead). >> >> 2. Simply adding a single call to WaitForParallelWorkersToFinish() >> within _bt_leader_heapscan() before waiting on our condition variable >> fixes the problem -- errors are reliably propagated, and we never end >> up waiting forever. >> >> 3. This short term fix works just as well with >> parallel_leader_participation=off. >> >> At this point, my preferred solution is for someone to go implement >> Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems >> like the logical person for the job). >> > > I can implement it and share a prototype patch with you which you can > use to test parallel sort stuff. I would like to highlight the > difference which you will see with WaitForParallelWorkersToAttach as > compare to WaitForParallelWorkersToFinish() is that the former will > give you how many of nworkers_launched workers are actually launched > whereas latter gives an error if any of the expected workers is not > launched. I feel former is good and your proposed way of calling it > after the leader is done with its work has alleviated the minor > disadvantage of this API which is that we need for workers to startup. > /we need for workers to startup./we need to wait for workers to startup. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 25, 2018 at 10:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> At this point, my preferred solution is for someone to go implement >> Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems >> like the logical person for the job). >> > > I can implement it and share a prototype patch with you which you can > use to test parallel sort stuff. That would be great. Thank you. > I would like to highlight the > difference which you will see with WaitForParallelWorkersToAttach as > compare to WaitForParallelWorkersToFinish() is that the former will > give you how many of nworkers_launched workers are actually launched > whereas latter gives an error if any of the expected workers is not > launched. I feel former is good and your proposed way of calling it > after the leader is done with its work has alleviated the minor > disadvantage of this API which is that we need for workers to startup. I'm not sure that it makes much difference, though, since in the end WaitForParallelWorkersToFinish() is called anyway, much like nodeGather.c. Have I missed something? I had imagined that WaitForParallelWorkersToAttach() would give me an error in the style of WaitForParallelWorkersToFinish(), without actually waiting for the parallel workers to finish. > However, now I see that you and Thomas are trying to find a different > way to overcome this problem differently, so not sure if I should go > ahead or not. I have seen that you told you wanted to look at > Thomas's proposed stuff carefully tomorrow, so I will wait for you > guys to decide which way is appropriate. I suspect that the overhead of Thomas' experimental approach is going to causes problems in certain cases. Cases that are hard to foresee. That patch makes HandleParallelMessages() set ParallelMessagePending artificially, pending confirmation of having launched all workers. It was an interesting experiment, but I think that your WaitForParallelWorkersToAttach() idea has a better chance of working out. >> Once that's committed, I can >> post a new version of the patch that uses that new infrastructure -- >> I'll add a call to the new function, without changing anything else. >> Failing that, we could actually just use >> WaitForParallelWorkersToFinish(). I still don't want to use a barrier, >> mostly because it complicates parallel_leader_participation=off, >> something that Amit is in agreement with [2][3]. >> > > I think if we want we can use barrier API's to solve this problem, but > I kind of have a feeling that it doesn't seem to be the most > appropriate API, especially because existing API like > WaitForParallelWorkersToFinish() can serve the need in a similar way. I can't see a way in which using a barrier can have less complexity. I think it will have quite a bit more, and I suspect that you share this feeling. > Just to conclude, following are proposed ways to solve this problem: > > 1. Implement a new API WaitForParallelWorkersToAttach and use that to > solve this problem. Peter G. and Amit thinks, this is a good way to > solve this problem. > 2. Use existing API WaitForParallelWorkersToFinish to solve this > problem. Peter G. feels that if API mentioned in #1 is not available, > we can use this to solve the problem and I agree with that position. > Thomas is not against it. > 3. Use Thomas's new way to detect such failures. It is not clear to > me at this stage if any one of us have accepted it to be the way to > proceed, but Thomas and Peter G. want to investigate it further. > 4. Use of Barrier API to solve this problem. Robert appears to be > strongly in favor of this approach. That's a good summary. The next revision of the patch will make the leader-participates-as-worker spool/Tuplelsortstate start and finish sorting before the main leader spool/Tuplelsortstate is even started. I did this with the intention of making it very clear that my approach does not assume a number of participants up-front -- that is actually something we only need a final answer on at the point that the leader merges, which is logically the last possible moment. Hopefully this will reassure Robert. It is quite a small change, but leads to a slightly cleaner organization within nbtsort.c, since _bt_begin_parallel() is the only point that has to deal with leader participation. Another minor advantage is that this makes the trace_sort overheads/duration for each of the two tuplesort within the leader non-overlapping (when the leader participates as a worker). -- Peter Geoghegan
On Fri, Jan 26, 2018 at 12:00 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Jan 25, 2018 at 10:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> At this point, my preferred solution is for someone to go implement >>> Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems >>> like the logical person for the job). >>> >> >> I can implement it and share a prototype patch with you which you can >> use to test parallel sort stuff. > > That would be great. Thank you. > >> I would like to highlight the >> difference which you will see with WaitForParallelWorkersToAttach as >> compare to WaitForParallelWorkersToFinish() is that the former will >> give you how many of nworkers_launched workers are actually launched >> whereas latter gives an error if any of the expected workers is not >> launched. I feel former is good and your proposed way of calling it >> after the leader is done with its work has alleviated the minor >> disadvantage of this API which is that we need for workers to startup. > > I'm not sure that it makes much difference, though, since in the end > WaitForParallelWorkersToFinish() is called anyway, much like > nodeGather.c. Have I missed something? > Nopes, you are right. I had in my mind that if we have something like what I am proposing, then we don't even need to detect failures in WaitForParallelWorkersToFinish and we can finish the work without failing. > I had imagined that WaitForParallelWorkersToAttach() would give me an > error in the style of WaitForParallelWorkersToFinish(), without > actually waiting for the parallel workers to finish. > I think that is also doable. I will give it a try and report back if I see any problem with it. However, it might take me some time as I am busy with few other things and I am planning to take two days off for some personal reasons, OTOH if it turns out to be a simple (which I expect it should be), then I will report back early. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Jan 26, 2018 at 1:30 AM, Peter Geoghegan <pg@bowt.ie> wrote: > I had imagined that WaitForParallelWorkersToAttach() would give me an > error in the style of WaitForParallelWorkersToFinish(), without > actually waiting for the parallel workers to finish. +1. If we're going to go that route, and that seems to be the consensus, then I think an error is more appropriate than returning an updated worker count. On the question of whether this is better or worse than using barriers, I'm not entirely sure. I understand that various objections to the Barrier concept have been raised, but I'm not personally convinced by any of them. On the other hand, if we only have to call WaitForParallelWorkersToAttach after the leader finishes its own sort, then there's no latency advantage to the barrier approach. I suspect we might still end up reworking this if we add the ability for new workers to join an index build in medias res at some point in the future -- but, as Peter points out, maybe the whole algorithm would get reworked in that scenario. So, since other people like WaitForParallelWorkersToAttach, I think we can just go with that for now. I don't want to kill this patch with unnecessary nitpicking. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 26, 2018 at 10:01 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 26, 2018 at 1:30 AM, Peter Geoghegan <pg@bowt.ie> wrote: >> I had imagined that WaitForParallelWorkersToAttach() would give me an >> error in the style of WaitForParallelWorkersToFinish(), without >> actually waiting for the parallel workers to finish. > > +1. If we're going to go that route, and that seems to be the > consensus, then I think an error is more appropriate than returning an > updated worker count. Great. Should I wait for Amit's WaitForParallelWorkersToAttach() patch to be posted, reviewed, and committed, or would you like to see what I came up with ("The next revision of the patch will make the leader-participates-as-worker spool/Tuplelsortstate start and finish sorting before the main leader spool/Tuplelsortstate is even started") today? -- Peter Geoghegan
On Fri, Jan 26, 2018 at 1:17 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Jan 26, 2018 at 10:01 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Jan 26, 2018 at 1:30 AM, Peter Geoghegan <pg@bowt.ie> wrote: >>> I had imagined that WaitForParallelWorkersToAttach() would give me an >>> error in the style of WaitForParallelWorkersToFinish(), without >>> actually waiting for the parallel workers to finish. >> >> +1. If we're going to go that route, and that seems to be the >> consensus, then I think an error is more appropriate than returning an >> updated worker count. > > Great. > > Should I wait for Amit's WaitForParallelWorkersToAttach() patch to be > posted, reviewed, and committed, or would you like to see what I came > up with ("The next revision of the patch will make the > leader-participates-as-worker spool/Tuplelsortstate start and finish > sorting before the main leader spool/Tuplelsortstate is even started") > today? I'm busy with other things, so no rush. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 26, 2018 at 10:33 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I'm busy with other things, so no rush. Got it. There is one question that I should probably get clarity on ahead of the next revision, which is: Should I rip out the code that disallows a "degenerate parallel CREATE INDEX" when parallel_leader_participation=off, or should I instead rip out any code that deals with parallel_leader_participation, and always have the leader participate as a worker? If I did the latter, then leader non-participation would live on as a #define debug option within nbtsort.c. It definitely seems like we'd want to preserve that at a minimum. -- Peter Geoghegan
On Fri, Jan 26, 2018 at 2:04 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Jan 26, 2018 at 10:33 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I'm busy with other things, so no rush. > > Got it. > > There is one question that I should probably get clarity on ahead of > the next revision, which is: Should I rip out the code that disallows > a "degenerate parallel CREATE INDEX" when > parallel_leader_participation=off, or should I instead rip out any > code that deals with parallel_leader_participation, and always have > the leader participate as a worker? > > If I did the latter, then leader non-participation would live on as a > #define debug option within nbtsort.c. It definitely seems like we'd > want to preserve that at a minimum. Hmm, I like the idea of making it a #define instead of having it depend on parallel_leader_participation. Let's do that. If the consensus is later that it was the wrong decision, it'll be easy to change it back. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 26, 2018 at 11:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Hmm, I like the idea of making it a #define instead of having it > depend on parallel_leader_participation. Let's do that. If the > consensus is later that it was the wrong decision, it'll be easy to > change it back. WFM. -- Peter Geoghegan
On Fri, Jan 26, 2018 at 7:30 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Jan 25, 2018 at 10:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> However, now I see that you and Thomas are trying to find a different >> way to overcome this problem differently, so not sure if I should go >> ahead or not. I have seen that you told you wanted to look at >> Thomas's proposed stuff carefully tomorrow, so I will wait for you >> guys to decide which way is appropriate. > > I suspect that the overhead of Thomas' experimental approach is going > to causes problems in certain cases. Cases that are hard to foresee. > That patch makes HandleParallelMessages() set ParallelMessagePending > artificially, pending confirmation of having launched all workers. > > It was an interesting experiment, but I think that your > WaitForParallelWorkersToAttach() idea has a better chance of working > out. Thanks for looking into this. Yeah. I think you're right that it could add a bit of overhead in some cases (ie if you receive a lot of signals that AREN'T caused by fork failure, then you'll enter HandleParallelMessage() every time unnecessarily), and it does feel a bit kludgy. The best idea I have to fix that so far is like this: (1) add a member fork_failure_count to struct BackgroundWorkerArray, (2) in do_start_bgworker() whenever fork fails, do ++BackgroundWorkerData->fork_failure_count (ie before a signal is sent to the leader), (3) in procsignal_sigusr1_handler where we normally do a bunch of CheckProcSignal(PROCSIG_XXX) stuff, if (BackgroundWorkerData->fork_failure_count != last_observed_fork_failure_count) HandleParallelMessageInterrupt(). As far as I know, as long as fork_failure_count is (say) int32 (ie not prone to tearing) then no locking is required due to the barriers implicit in the syscalls involved there. This is still slightly more pessimistic than it needs to be (the failed fork may be for someone else's ParallelContext), but only in rare cases so it would be practically as good as precise PROCSIG delivery. It's just that we aren't allowed to deliver PROCSIGs from the postmaster. We are allowed to communicate through BackgroundWorkerData, and there is a precedent for cluster-visible event counters in there already. I think you should proceed with Amit's plan. If we ever make a plan like the above work in future, it'd render that redundant by turning every CFI() into a cancellation point for fork failure, but I'm not planning to investigate further given the muted response to my scheming in this area so far. -- Thomas Munro http://www.enterprisedb.com
On Fri, Jan 26, 2018 at 6:40 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Thanks for looking into this. Yeah. I think you're right that it > could add a bit of overhead in some cases (ie if you receive a lot of > signals that AREN'T caused by fork failure, then you'll enter > HandleParallelMessage() every time unnecessarily), and it does feel a > bit kludgy. The best idea I have to fix that so far is like this: (1) > add a member fork_failure_count to struct BackgroundWorkerArray, (2) > in do_start_bgworker() whenever fork fails, do > ++BackgroundWorkerData->fork_failure_count (ie before a signal is sent > to the leader), (3) in procsignal_sigusr1_handler where we normally do > a bunch of CheckProcSignal(PROCSIG_XXX) stuff, if > (BackgroundWorkerData->fork_failure_count != > last_observed_fork_failure_count) HandleParallelMessageInterrupt(). > As far as I know, as long as fork_failure_count is (say) int32 (ie not > prone to tearing) then no locking is required due to the barriers > implicit in the syscalls involved there. This is still slightly more > pessimistic than it needs to be (the failed fork may be for someone > else's ParallelContext), but only in rare cases so it would be > practically as good as precise PROCSIG delivery. It's just that we > aren't allowed to deliver PROCSIGs from the postmaster. We are > allowed to communicate through BackgroundWorkerData, and there is a > precedent for cluster-visible event counters in there already. I could sign on to that plan, but I don't think we should hold this patch up for it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 26, 2018 at 12:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Jan 26, 2018 at 12:00 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> On Thu, Jan 25, 2018 at 10:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> At this point, my preferred solution is for someone to go implement >>>> Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems >>>> like the logical person for the job). >>>> >>> >>> I can implement it and share a prototype patch with you which you can >>> use to test parallel sort stuff. >> >> That would be great. Thank you. >> >>> I would like to highlight the >>> difference which you will see with WaitForParallelWorkersToAttach as >>> compare to WaitForParallelWorkersToFinish() is that the former will >>> give you how many of nworkers_launched workers are actually launched >>> whereas latter gives an error if any of the expected workers is not >>> launched. I feel former is good and your proposed way of calling it >>> after the leader is done with its work has alleviated the minor >>> disadvantage of this API which is that we need for workers to startup. >> >> I'm not sure that it makes much difference, though, since in the end >> WaitForParallelWorkersToFinish() is called anyway, much like >> nodeGather.c. Have I missed something? >> > > Nopes, you are right. I had in my mind that if we have something like > what I am proposing, then we don't even need to detect failures in > WaitForParallelWorkersToFinish and we can finish the work without > failing. > >> I had imagined that WaitForParallelWorkersToAttach() would give me an >> error in the style of WaitForParallelWorkersToFinish(), without >> actually waiting for the parallel workers to finish. >> > > I think that is also doable. I will give it a try and report back if > I see any problem with it. > I have posted the patch for the above API and posted it on a new thread [1]. Do let me know either here or on that thread if the patch suffices your need? [1] - https://www.postgresql.org/message-id/CAA4eK1%2Be2MzyouF5bg%3DOtyhDSX%2B%3DAo%3D3htN%3DT-r_6s3gCtKFiw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Jan 27, 2018 at 12:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I have posted the patch for the above API and posted it on a new > thread [1]. Do let me know either here or on that thread if the patch > suffices your need? I've responded to you over on that thread. Thanks again for helping me. I already have a revision of my patch lined up that is coded to target your new WaitForParallelWorkersToAttach() interface, plus some other changes. These include: * Make the leader's worker Tuplesortstate complete before the main leader Tuplesortstate even begins, making it very clear that nbtsort.c does not rely on knowing the number of launched workers up-front. That should make Robert a bit happier about our ability to add additional workers fairly late in the process, in a future tuplesort client that finds that to be useful. * I've added a new "parallel" argument to index_build(), which controls whether or not we even call the plan_create_index_workers() cost model. When this is false, we always do a serial build. This was added because I noticed that TRUNCATE REINDEXes the table at a time when parallelism couldn't possibly be useful, which still used parallelism. Best to have the top-level caller opt in or opt out. * Polished the docs some more. * Improved commentary on randomAccess/writable leader handling within logtape.c. We could still support that, if we were willing to make shared Buffiles that are opened within another backend writable. I'm not proposing to do that, but it's nice that we could. I hesitate to post something that won't cleanly apply on the master branch's tip, but otherwise I am ready to send this new revision of the patch right away. It seems likely that Robert will commit your patch within a matter of days, once some minor issues are worked through, at which point I'll send what I have. if anyone prefers, I can post the patch immediately, and break out the WaitForParallelWorkersToAttach() as the second patch in a cumulative patch set. Right now, I'm out of things to work on here. Notes on how I've stress-tested parallel CREATE INDEX: I can recommend using the amcheck heapallverified functionality [1] from the Github version of amcheck to test this patch. You will need to modify the call to IndexBuildHeapScan() that the extension makes, to add a new NULL "scan" argument, since parallel CREATE INDEX changes the signature of IndexBuildHeapScan(). That's trivial, though. Note that parallel CREATE INDEX should produce relfiles that are physically identical to a serial CREATE INDEX, since index tuplesorts are generally always deterministic. IOW, we use a heap TID tie-breaker within tuplesort.c from B-Tree index tuples, which assures us that varying maintenance_work_mem won't affect the final output even in a tiny, insignificant way -- using parallelism should not change anything about the exact output, either. At one point I was testing this patch by verifying not only that indexes were sane, but that they were physically identical to what a serial sort (in the master branch) produced (I only needed to mask page LSNs). Finally, yet another good way to test this patch is to verify that everything continues to work when MAX_PHYSICAL_FILESIZE is modified to be BLCKSZ (2^13 rather than 2^30). You will get many many BufFile segments that way, which could in theory reveal bugs in rare edge cases that I haven't considered. This is a strategy that led to my finding a bug in v10 at one point [2], as well as bugs in earlier versions of Thomas' parallel hash join patch set. It worked for me twice already, so it seems like a good strategy. It may be worth *combining* with some other stress-testing strategy. [1] https://github.com/petergeoghegan/amcheck#optional-heapallindexed-verification [2] https://www.postgresql.org/message-id/CAM3SWZRWdNtkhiG0GyiX_1mUAypiK3dV6-6542pYe2iEL-foTA@mail.gmail.com -- Peter Geoghegan
On Mon, Jan 29, 2018 at 4:06 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Sat, Jan 27, 2018 at 12:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I have posted the patch for the above API and posted it on a new >> thread [1]. Do let me know either here or on that thread if the patch >> suffices your need? > > I've responded to you over on that thread. Thanks again for helping me. > > I already have a revision of my patch lined up that is coded to target > your new WaitForParallelWorkersToAttach() interface, plus some other > changes. Attached patch has these changes. -- Peter Geoghegan
Attachment
On Fri, Feb 2, 2018 at 11:16 AM, Peter Geoghegan <pg@bowt.ie> wrote: > Attached patch has these changes. And that patch you attached is also, now, committed. If you could keep an eye on the buildfarm and investigate anything that breaks, I would appreciate it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 2, 2018 at 10:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: > And that patch you attached is also, now, committed. > > If you could keep an eye on the buildfarm and investigate anything > that breaks, I would appreciate it. Fantastic! I can keep an eye on it throughout the day. Thanks everyone -- Peter Geoghegan
On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote: > Thanks everyone I would like to acknowledge the assistance of Corey Huinker with early testing of the patch (this took place in 2016, and much of it was not on-list). Even though he wasn't credited in the commit message, he should appear in the V11 release notes reviewer list IMV. His contribution certainly merits it. -- Peter Geoghegan
On Fri, Feb 2, 2018 at 3:23 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote: >> Thanks everyone > > I would like to acknowledge the assistance of Corey Huinker with early > testing of the patch (this took place in 2016, and much of it was not > on-list). Even though he wasn't credited in the commit message, he > should appear in the V11 release notes reviewer list IMV. His > contribution certainly merits it. For the record, I typically construct the list of reviewers by reading over the thread and adding all the people whose names I find there in chronological order, excluding things that are clearly not review (like "Bumped to next CF.") and opinions on narrow questions that don't indicate that any code-reading or testing was done (like "+1 for calling the GUC foo_bar_baz rather than quux_bletch".) I saw that you copied Corey on the original email, but I see no posts from him on the thread, which is why he didn't get included in the commit message. While I have no problem with him being included in the release notes, I obviously can't know about activity that happens entirely off-list. If you mentioned somewhere in the 200+ message on this topic that he should be included, I missed that, too. I think it's much harder to give credit adequately when contributions are off-list; letting everyone know what's going on is why we have a list. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 2, 2018 at 12:30 PM, Robert Haas <robertmhaas@gmail.com> wrote: > For the record, I typically construct the list of reviewers by reading > over the thread and adding all the people whose names I find there in > chronological order, excluding things that are clearly not review > (like "Bumped to next CF.") and opinions on narrow questions that > don't indicate that any code-reading or testing was done (like "+1 for > calling the GUC foo_bar_baz rather than quux_bletch".) I saw that you > copied Corey on the original email, but I see no posts from him on the > thread, which is why he didn't get included in the commit message. I did credit him in my own proposed commit message. I know that it's not part of your workflow to preserve that, but I had assumed that that would at least be taken into account. Anyway, mistakes like this happen. I'm glad that we now have the reviewer credit list, so that they can be corrected afterwards. -- Peter Geoghegan
On Fri, Feb 2, 2018 at 3:35 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Feb 2, 2018 at 12:30 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> For the record, I typically construct the list of reviewers by reading >> over the thread and adding all the people whose names I find there in >> chronological order, excluding things that are clearly not review >> (like "Bumped to next CF.") and opinions on narrow questions that >> don't indicate that any code-reading or testing was done (like "+1 for >> calling the GUC foo_bar_baz rather than quux_bletch".) I saw that you >> copied Corey on the original email, but I see no posts from him on the >> thread, which is why he didn't get included in the commit message. > > I did credit him in my own proposed commit message. I know that it's > not part of your workflow to preserve that, but I had assumed that > that would at least be taken into account. Ah. Sorry, I didn't look at that. I try to remember to look at proposed commit messages, but not everyone includes them, which is probably part of the reason I don't always remember to look for them. Or maybe I just have failed to adequately develop that habit... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Feb 2, 2018 at 10:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> If you could keep an eye on the buildfarm and investigate anything >> that breaks, I would appreciate it. > I can keep an eye on it throughout the day. There is a benign Valgrind error that causes the lousyjack animal to report failure. It looks like this: ==6850== Syscall param write(buf) points to uninitialised byte(s) ==6850== at 0x4E4D534: write (in /usr/lib64/libpthread-2.26.so) ==6850== by 0x82328F: FileWrite (fd.c:2017) ==6850== by 0x8261AD: BufFileDumpBuffer (buffile.c:513) ==6850== by 0x826569: BufFileFlush (buffile.c:657) ==6850== by 0x8262FB: BufFileRead (buffile.c:561) ==6850== by 0x9F6C79: ltsReadBlock (logtape.c:273) ==6850== by 0x9F7ACF: LogicalTapeFreeze (logtape.c:906) ==6850== by 0xA05B0D: worker_freeze_result_tape (tuplesort.c:4477) ==6850== by 0xA05BC6: worker_nomergeruns (tuplesort.c:4499) ==6850== by 0x9FCA1E: tuplesort_performsort (tuplesort.c:1823) I'll need to go and write a Valgrind suppression for this. I'll get to it later today. -- Peter Geoghegan
On 2018-02-02 13:35:59 -0800, Peter Geoghegan wrote: > On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Feb 2, 2018 at 10:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> If you could keep an eye on the buildfarm and investigate anything > >> that breaks, I would appreciate it. > > > I can keep an eye on it throughout the day. > > There is a benign Valgrind error that causes the lousyjack animal to > report failure. It looks like this: > > ==6850== Syscall param write(buf) points to uninitialised byte(s) > ==6850== at 0x4E4D534: write (in /usr/lib64/libpthread-2.26.so) > ==6850== by 0x82328F: FileWrite (fd.c:2017) > ==6850== by 0x8261AD: BufFileDumpBuffer (buffile.c:513) > ==6850== by 0x826569: BufFileFlush (buffile.c:657) > ==6850== by 0x8262FB: BufFileRead (buffile.c:561) > ==6850== by 0x9F6C79: ltsReadBlock (logtape.c:273) > ==6850== by 0x9F7ACF: LogicalTapeFreeze (logtape.c:906) > ==6850== by 0xA05B0D: worker_freeze_result_tape (tuplesort.c:4477) > ==6850== by 0xA05BC6: worker_nomergeruns (tuplesort.c:4499) > ==6850== by 0x9FCA1E: tuplesort_performsort (tuplesort.c:1823) Not saying you're wrong, but you should include a comment on why this is a benign warning. Presumably it's some padding memory somewhere, but it's not obvious from the above bleat. Greetings, Andres Freund
On Fri, Feb 2, 2018 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote: > Not saying you're wrong, but you should include a comment on why this is > a benign warning. Presumably it's some padding memory somewhere, but > it's not obvious from the above bleat. Sure. This looks slightly more complicated than first anticipated, but I'll keep everyone posted. Valgrind suppression aside, this raises another question. The stack trace shows that the error happens during the creation of a new TOAST table (CheckAndCreateToastTable()). I wonder if I should also pass down a flag that makes sure that parallelism is never even attempted from that path, to match TRUNCATE's suppression of parallel index builds during its reindexing. It really shouldn't be a problem as things stand, but maybe it's better to be consistent about "useless" parallel CREATE INDEX attempts, and suppress them here too. -- Peter Geoghegan
On Fri, Feb 2, 2018 at 4:31 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Feb 2, 2018 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote: >> Not saying you're wrong, but you should include a comment on why this is >> a benign warning. Presumably it's some padding memory somewhere, but >> it's not obvious from the above bleat. > > Sure. This looks slightly more complicated than first anticipated, but > I'll keep everyone posted. I couldn't make up my mind if it was best to prevent the uninitialized write(), or to instead just add a suppression. I eventually decided upon the suppression -- see attached patch. My proposed commit message has a full explanation of the Valgrind issue, which I won't repeat here. Go read it before reading the rest of this e-mail. It might seem like my suppression is overly broad, or not broad enough, since it essentially targets LogicalTapeFreeze(). I don't think it is, though, because this can occur in two places within LogicalTapeFreeze() -- it can occur in the place we actually saw the issue on lousyjack, from the ltsReadBlock() call within LogicalTapeFreeze(), as well as a second place -- when BufFileExportShared() is called. I found that you have to tweak code to prevent it happening in the first place before you'll see it happen in the second place. I see no point in actually playing whack-a-mole for a totally benign issue like this, though, which made me finally decide upon the suppression approach. Bear in mind that a third way of fixing this would be to allocate logtape.c buffers using palloc0() rather than palloc() (though I don't like that idea at all). For serial external sorts, the logtape.c buffers are guaranteed to have been written to/initialized at least once as part of spilling a sort to disk. Parallel external sorts don't quite guarantee that, which is why we run into this Valgrind issue. -- Peter Geoghegan
Attachment
On Fri, Feb 2, 2018 at 10:26 PM, Peter Geoghegan <pg@bowt.ie> wrote: > My proposed commit message > has a full explanation of the Valgrind issue, which I won't repeat > here. Go read it before reading the rest of this e-mail. I'm going to paste the first two sentences of your proposed commit message in here for the convenience of other readers, since I want to reply to them. # LogicalTapeFreeze() may write out its first block when it is dirty but # not full, and then immediately read the first block back in from its # BufFile as a BLCKSZ-width block. This can only occur in rare cases # where next to no tuples were written out, which is only possible with # parallel external tuplesorts. So, if I understand correctly what you're saying here, valgrind is totally cool with us writing out an only-partially-initialized block to a disk file, but it's got a real problem with us reading that data back into the same memory space it already occupies. That's a little odd. I presume that it's common for the tail of the final block written to be uninitialized, but normally when we then go read block 0, that's some other, fully initialized block. It seems like it would be pretty easy to just suppress the useless read when we've already got the correct data, and I'd lean toward going that direction since it's a valid optimization anyway. But I'd like to hear some opinions from people who use and think about valgrind more than I do (Tom, Andres, Noah, ...?). > It might seem like my suppression is overly broad, or not broad > enough, since it essentially targets LogicalTapeFreeze(). I don't > think it is, though, because this can occur in two places within > LogicalTapeFreeze() -- it can occur in the place we actually saw the > issue on lousyjack, from the ltsReadBlock() call within > LogicalTapeFreeze(), as well as a second place -- when > BufFileExportShared() is called. I found that you have to tweak code > to prevent it happening in the first place before you'll see it happen > in the second place. I don't quite see how that would happen, because BufFileExportShared, at least AFAICS, doesn't touch the buffer? Unfortunately valgrind does not work at all on my laptop -- the server appears to start, but as soon as you try to connect, the whole thing dies with an error claiming that the startup process has failed. So I can't easily test this at the moment. I'll try to get it working, here or elsewhere, but thought I'd send the above reply first. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Feb 5, 2018 at 9:43 AM, Robert Haas <robertmhaas@gmail.com> wrote: > # LogicalTapeFreeze() may write out its first block when it is dirty but > # not full, and then immediately read the first block back in from its > # BufFile as a BLCKSZ-width block. This can only occur in rare cases > # where next to no tuples were written out, which is only possible with > # parallel external tuplesorts. > > So, if I understand correctly what you're saying here, valgrind is > totally cool with us writing out an only-partially-initialized block > to a disk file, but it's got a real problem with us reading that data > back into the same memory space it already occupies. That's not quite it. Valgrind is cool with a BufFileWrite(), which doesn't result in an actual write() because the buffile.c stdio-style buffer (which isn't where the uninitialized bytes originate from) isn't yet filled. The actual write() comes later, and that's the point that Valgrind complains. IOW, Valgrind is cool with copying around uninitialized memory before we do anything with the underlying values (e.g., write(), something that affects control flow). > I presume that it's common for the tail of the final block > written to be uninitialized, but normally when we then go read block > 0, that's some other, fully initialized block. It certainly is common. In the case of logtape.c, we almost always write out some garbage bytes, even with serial sorts. The only difference here is the *sense* in which they're garbage: they're uninitialized bytes, which Valgrind cares about, rather than byte from previous writes that are left behind in the buffer, which Valgrind does not care about. >> It might seem like my suppression is overly broad, or not broad >> enough, since it essentially targets LogicalTapeFreeze(). I don't >> think it is, though, because this can occur in two places within >> LogicalTapeFreeze() -- it can occur in the place we actually saw the >> issue on lousyjack, from the ltsReadBlock() call within >> LogicalTapeFreeze(), as well as a second place -- when >> BufFileExportShared() is called. I found that you have to tweak code >> to prevent it happening in the first place before you'll see it happen >> in the second place. > > I don't quite see how that would happen, because BufFileExportShared, > at least AFAICS, doesn't touch the buffer? It doesn't have to -- at least not directly. Valgrind remembers that the uninitialized memory from logtape.c buffers are poisoned -- it "spreads". The knowledge that the bytes are poisoned is tracked as they're copied around. You get the error on the write() from the BufFile buffer, despite the fact that you can make the error go away by using palloc0() instead of palloc() within logtape.c, and nowhere else. -- Peter Geoghegan
On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan <pg@bowt.ie> wrote: > It certainly is common. In the case of logtape.c, we almost always > write out some garbage bytes, even with serial sorts. The only > difference here is the *sense* in which they're garbage: they're > uninitialized bytes, which Valgrind cares about, rather than byte from > previous writes that are left behind in the buffer, which Valgrind > does not care about. /me face-palms. So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED on the buffer. "We know what we're doing, trust us!" In some ways, that seems better than inserting a suppression, because it only affects the memory in the buffer. Anybody else want to express an opinion here? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, February 5, 2018 4:27 pm, Robert Haas wrote: > On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> It certainly is common. In the case of logtape.c, we almost always >> write out some garbage bytes, even with serial sorts. The only >> difference here is the *sense* in which they're garbage: they're >> uninitialized bytes, which Valgrind cares about, rather than byte from >> previous writes that are left behind in the buffer, which Valgrind >> does not care about. > > /me face-palms. > > So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED > on the buffer. "We know what we're doing, trust us!" > > In some ways, that seems better than inserting a suppression, because > it only affects the memory in the buffer. > > Anybody else want to express an opinion here? Are the uninitialized bytes that are written out "whatever was in the memory previously" or just some "0x00 bytes from the allocation but not yet overwritten from the PG code"? Because the first sounds like it could be a security problem - if random junk bytes go out to the disk, and stay there, information could inadvertedly leak to permanent storage. Best regards, Tels
On Mon, Feb 5, 2018 at 1:27 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> It certainly is common. In the case of logtape.c, we almost always >> write out some garbage bytes, even with serial sorts. The only >> difference here is the *sense* in which they're garbage: they're >> uninitialized bytes, which Valgrind cares about, rather than byte from >> previous writes that are left behind in the buffer, which Valgrind >> does not care about. I should clarify what I meant here -- it is very common when we have to freeze a tape, like when we do a serial external randomAccess tuplesort, or a parallel worker's tuplesort. It shouldn't happen otherwise. Note that there is a general pattern of dumping out the current buffer just as the next one is needed, in order to make sure that the linked list pointer correctly points to the next/soon-to-be-current block. Note also that the majority of routines declared within logtape.c can only be used on frozen tapes. I am pretty confident that I've scoped this correctly by targeting LogicalTapeFreeze(). > So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED > on the buffer. "We know what we're doing, trust us!" > > In some ways, that seems better than inserting a suppression, because > it only affects the memory in the buffer. I think that that would also work, and would be simpler, but also slightly inferior to using the proposed suppression. If there is garbage in logtape.c buffers, we still generally don't want to do anything important on the basis of those values. We make one exception with the suppression, which is a pretty typical kind of exception to make -- don't worry if we write() poisoned bytes, since those are bound to be alignment related. OTOH, as I've said we are generally bound to write some kind of logtape.c garbage, which will almost certainly not be of the uninitialized memory variety. So, while I feel that the suppression is better, the advantage is likely microscopic. -- Peter Geoghegan
On Mon, Feb 5, 2018 at 1:39 PM, Tels <nospam-abuse@bloodgate.com> wrote: > Are the uninitialized bytes that are written out "whatever was in the > memory previously" or just some "0x00 bytes from the allocation but not > yet overwritten from the PG code"? > > Because the first sounds like it could be a security problem - if random > junk bytes go out to the disk, and stay there, information could > inadvertedly leak to permanent storage. But you can say the same thing about *any* of the write()-of-uninitialized-bytes Valgrind suppressions that already exist. There are quite a few of those. That just isn't part of our security model. -- Peter Geoghegan
On Mon, Feb 5, 2018 at 1:45 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED >> on the buffer. "We know what we're doing, trust us!" >> >> In some ways, that seems better than inserting a suppression, because >> it only affects the memory in the buffer. > > I think that that would also work, and would be simpler, but also > slightly inferior to using the proposed suppression. If there is > garbage in logtape.c buffers, we still generally don't want to do > anything important on the basis of those values. We make one exception > with the suppression, which is a pretty typical kind of exception to > make -- don't worry if we write() poisoned bytes, since those are > bound to be alignment related. > > OTOH, as I've said we are generally bound to write some kind of > logtape.c garbage, which will almost certainly not be of the > uninitialized memory variety. So, while I feel that the suppression is > better, the advantage is likely microscopic. Attached patch does it to the tail of the buffer, as Tom suggested on the -committers thread. Note that there is one other place in logtape.c that can write a partial block like this: LogicalTapeRewindForRead(). I haven't bothered to do anything there, since it cannot possibly be affected by this issue for the same reason that serial sorts cannot be -- it's code that is only used by a tuplesort that really needs to spill to disk, and merge multiple runs (or for tapes that have already been frozen, that are expected to never reallocate logtape.c buffers). -- Peter Geoghegan
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > Unfortunately valgrind does not work at all on my laptop -- the server > appears to start, but as soon as you try to connect, the whole thing > dies with an error claiming that the startup process has failed. So I > can't easily test this at the moment. I'll try to get it working, > here or elsewhere, but thought I'd send the above reply first. Do you want somebody who does have a working valgrind installation (ie me) to take responsibility for pushing this patch? regards, tom lane
On Tue, Feb 6, 2018 at 2:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Unfortunately valgrind does not work at all on my laptop -- the server >> appears to start, but as soon as you try to connect, the whole thing >> dies with an error claiming that the startup process has failed. So I >> can't easily test this at the moment. I'll try to get it working, >> here or elsewhere, but thought I'd send the above reply first. > > Do you want somebody who does have a working valgrind installation > (ie me) to take responsibility for pushing this patch? I committed it before seeing this. It probably would've been better if you had done it, but I assume Peter tested it, so let's see what the BF thinks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 6, 2018 at 12:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Do you want somebody who does have a working valgrind installation >> (ie me) to take responsibility for pushing this patch? > > I committed it before seeing this. It probably would've been better > if you had done it, but I assume Peter tested it, so let's see what > the BF thinks. I did test it with a full "make installcheck" + valgrind-3.11.0. I'd be very surprised if this doesn't make the buildfarm go green. -- Peter Geoghegan
On 02/06/2018 09:56 PM, Peter Geoghegan wrote: > On Tue, Feb 6, 2018 at 12:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Do you want somebody who does have a working valgrind installation >>> (ie me) to take responsibility for pushing this patch? >> >> I committed it before seeing this. It probably would've been better >> if you had done it, but I assume Peter tested it, so let's see what >> the BF thinks. > > I did test it with a full "make installcheck" + valgrind-3.11.0. I'd > be very surprised if this doesn't make the buildfarm go green. > Did you do a test with "-O0"? In my experience that makes valgrind tests much more reliable and repeatable. Some time ago we've seen cases that were failing for me but not for others, and I suspect it was due to me using "-O0". (This is more a random comment than a suggestion that you patch won't make the buildfarm green.) regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Feb 6, 2018 at 1:04 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Did you do a test with "-O0"? In my experience that makes valgrind tests > much more reliable and repeatable. Some time ago we've seen cases that > were failing for me but not for others, and I suspect it was due to me > using "-O0". FWIW, I use -O1 when configure is run for Valgrind. I also turn off assertions (this is all scripted). According to the Valgrind manual: "With -O1 line numbers in error messages can be inaccurate, although generally speaking running Memcheck on code compiled at -O1 works fairly well, and the speed improvement compared to running -O0 is quite significant. Use of -O2 and above is not recommended as Memcheck occasionally reports uninitialised-value errors which don’t really exist." The manual does also say that there might even be some problems with -O1 at a later point, but it sounds like it's probably worth it to me. Skink uses -Og, FWIW. -- Peter Geoghegan
On 02/06/2018 10:14 PM, Peter Geoghegan wrote: > On Tue, Feb 6, 2018 at 1:04 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> Did you do a test with "-O0"? In my experience that makes valgrind tests >> much more reliable and repeatable. Some time ago we've seen cases that >> were failing for me but not for others, and I suspect it was due to me >> using "-O0". > > FWIW, I use -O1 when configure is run for Valgrind. I also turn off > assertions (this is all scripted). According to the Valgrind manual: > > "With -O1 line numbers in error messages can be inaccurate, although > generally speaking running Memcheck on code compiled at -O1 works > fairly well, and the speed improvement compared to running -O0 is > quite significant. Use of -O2 and above is not recommended as Memcheck > occasionally reports uninitialised-value errors which don’t really > exist." > OK, although I was suggesting the optimizations may actually have the opposite effect - valgrind missing some of the invalid memory accesses (until the compiler decides not use them for some reason, causing sudden valgrind failures). > The manual does also say that there might even be some problems with > -O1 at a later point, but it sounds like it's probably worth it to me. > Skink uses -Og, FWIW. > I have little idea what -Og exactly means. It seems to be focused on debugging experience, and so still does some of the optimizations. Which I think would explain why skink was not detecting some of the failures for a long time. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Feb 6, 2018 at 1:30 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I have little idea what -Og exactly means. It seems to be focused on > debugging experience, and so still does some of the optimizations. As I understand it, -Og allows any optimization that does not hamper walking through code with a debugger. > Which > I think would explain why skink was not detecting some of the failures > for a long time. I think that skink didn't detect failures until now because the code wasn't exercised until parallel CREATE INDEX was added, simply because the function LogicalTapeFreeze() was never reached (though that's not the only reason, it is the most obvious one). -- Peter Geoghegan
On 02/06/2018 10:39 PM, Peter Geoghegan wrote: > On Tue, Feb 6, 2018 at 1:30 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> I have little idea what -Og exactly means. It seems to be focused on >> debugging experience, and so still does some of the optimizations. > > As I understand it, -Og allows any optimization that does not hamper > walking through code with a debugger. > >> Which >> I think would explain why skink was not detecting some of the failures >> for a long time. > > I think that skink didn't detect failures until now because the code > wasn't exercised until parallel CREATE INDEX was added, simply because > the function LogicalTapeFreeze() was never reached (though that's not > the only reason, it is the most obvious one). > Maybe. What I had in mind was a different thread from November, discussing some non-deterministic valgrind failures: https://www.postgresql.org/message-id/flat/20171125200014.qbewtip5oydqsklt%40alap3.anarazel.de#20171125200014.qbewtip5oydqsklt@alap3.anarazel.de But you're right that may be irrelevant here. As I said, it was mostly just a random comment about valgrind. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Feb 6, 2018 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Feb 6, 2018 at 2:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> Unfortunately valgrind does not work at all on my laptop -- the server >>> appears to start, but as soon as you try to connect, the whole thing >>> dies with an error claiming that the startup process has failed. So I >>> can't easily test this at the moment. I'll try to get it working, >>> here or elsewhere, but thought I'd send the above reply first. >> >> Do you want somebody who does have a working valgrind installation >> (ie me) to take responsibility for pushing this patch? > > I committed it before seeing this. It probably would've been better > if you had done it, but I assume Peter tested it, so let's see what > the BF thinks. skink and lousyjack seem happy now, so I think it worked. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi all,
While testing this feature I found a crash on PG head with parallel create index using pgbanch tables.
-- GUCs under postgres.conf
max_parallel_maintenance_workers = 16
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 8GB
max_wal_size = 4GB
./pgbench -i -s 500 -d postgres
postgres=# create index pgb_acc_idx3 on pgbench_accounts(aid, abalance,filler);
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>
--
With Regards,
Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation
The Postgres Database Company
On Thu, Feb 8, 2018 at 6:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Feb 6, 2018 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 6, 2018 at 2:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Unfortunately valgrind does not work at all on my laptop -- the server
>>> appears to start, but as soon as you try to connect, the whole thing
>>> dies with an error claiming that the startup process has failed. So I
>>> can't easily test this at the moment. I'll try to get it working,
>>> here or elsewhere, but thought I'd send the above reply first.
>>
>> Do you want somebody who does have a working valgrind installation
>> (ie me) to take responsibility for pushing this patch?
>
> I committed it before seeing this. It probably would've been better
> if you had done it, but I assume Peter tested it, so let's see what
> the BF thinks.
skink and lousyjack seem happy now, so I think it worked.
On Wed, Mar 7, 2018 at 8:13 AM, Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
-- Hi all,While testing this feature I found a crash on PG head with parallel create index using pgbanch tables.-- GUCs under postgres.confmax_parallel_maintenance_workers = 16 max_parallel_workers = 16max_parallel_workers_per_gather = 8 maintenance_work_mem = 8GBmax_wal_size = 4GB./pgbench -i -s 500 -d postgrespostgres=# create index pgb_acc_idx3 on pgbench_accounts(aid, abalance,filler);WARNING: terminating connection because of crash of another server processDETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.HINT: In a moment you should be able to reconnect to the database and repeat your command.server closed the connection unexpectedlyThis probably means the server terminated abnormallybefore or while processing the request.The connection to the server was lost. Attempting reset: Failed.!>
That makes it look like perhaps one of the worker backends crashed. Did you get a message in the logfile that might indicate the nature of the crash? Something with PANIC or TRAP, perhaps?
On Wed, Mar 7, 2018 at 7:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Mar 7, 2018 at 8:13 AM, Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote: Hi all,While testing this feature I found a crash on PG head with parallel create index using pgbanch tables.-- GUCs under postgres.confmax_parallel_maintenance_workers = 16 max_parallel_workers = 16max_parallel_workers_per_gather = 8 maintenance_work_mem = 8GBmax_wal_size = 4GB./pgbench -i -s 500 -d postgrespostgres=# create index pgb_acc_idx3 on pgbench_accounts(aid, abalance,filler);WARNING: terminating connection because of crash of another server processDETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.HINT: In a moment you should be able to reconnect to the database and repeat your command.server closed the connection unexpectedlyThis probably means the server terminated abnormallybefore or while processing the request.The connection to the server was lost. Attempting reset: Failed.!>That makes it look like perhaps one of the worker backends crashed. Did you get a message in the logfile that might indicate the nature of the crash? Something with PANIC or TRAP, perhaps?
I am not able to see any PANIC/TRAP in log file,
Here are the contents.
[edb@localhost bin]$ cat logsnew
2018-03-07 19:21:20.922 IST [54400] LOG: listening on IPv6 address "::1", port 5432
2018-03-07 19:21:20.922 IST [54400] LOG: listening on IPv4 address "127.0.0.1", port 5432
2018-03-07 19:21:20.925 IST [54400] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2018-03-07 19:21:20.936 IST [54401] LOG: database system was shut down at 2018-03-07 19:21:20 IST
2018-03-07 19:21:20.939 IST [54400] LOG: database system is ready to accept connections
2018-03-07 19:24:44.263 IST [54400] LOG: background worker "parallel worker" (PID 54482) was terminated by signal 9: Killed
2018-03-07 19:24:44.286 IST [54400] LOG: terminating any other active server processes
2018-03-07 19:24:44.297 IST [54405] WARNING: terminating connection because of crash of another server process
2018-03-07 19:24:44.297 IST [54405] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2018-03-07 19:24:44.297 IST [54405] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2018-03-07 19:24:44.301 IST [54478] WARNING: terminating connection because of crash of another server process
2018-03-07 19:24:44.301 IST [54478] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2018-03-07 19:24:44.301 IST [54478] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2018-03-07 19:24:44.494 IST [54504] FATAL: the database system is in recovery mode
2018-03-07 19:24:44.496 IST [54400] LOG: all server processes terminated; reinitializing
2018-03-07 19:24:44.513 IST [54505] LOG: database system was interrupted; last known up at 2018-03-07 19:22:54 IST
2018-03-07 19:24:44.552 IST [54505] LOG: database system was not properly shut down; automatic recovery in progress
2018-03-07 19:24:44.554 IST [54505] LOG: redo starts at 0/AB401A38
2018-03-07 19:25:14.712 IST [54505] LOG: invalid record length at 1/818B8D80: wanted 24, got 0
2018-03-07 19:25:14.714 IST [54505] LOG: redo done at 1/818B8D48
2018-03-07 19:25:14.714 IST [54505] LOG: last completed transaction was at log time 2018-03-07 19:24:05.322402+05:30
2018-03-07 19:25:16.887 IST [54400] LOG: database system is ready to accept connections
--
With Regards,
Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation
The Postgres Database Company
On Wed, Mar 7, 2018 at 8:59 AM, Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
-- 2018-03-07 19:24:44.263 IST [54400] LOG: background worker "parallel worker" (PID 54482) was terminated by signal 9: Killed
That looks like the background worker got killed by the OOM killer. How much memory do you have in the machine where this occurred?
On 03/07/2018 03:21 PM, Robert Haas wrote: > On Wed, Mar 7, 2018 at 8:59 AM, Prabhat Sahu > <prabhat.sahu@enterprisedb.com <mailto:prabhat.sahu@enterprisedb.com>> > wrote: > > 2018-03-07 19:24:44.263 IST [54400] LOG: background worker > "parallel worker" (PID 54482) was terminated by signal 9: Killed > > > That looks like the background worker got killed by the OOM killer. How > much memory do you have in the machine where this occurred? > FWIW that's usually written to the system log. Does dmesg say something about the kill? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Mar 7, 2018 at 5:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > FWIW that's usually written to the system log. Does dmesg say something > about the kill? While it would be nice to confirm that it was indeed the OOM killer, either way the crash happened because SIGKILL was sent to a parallel worker. There is no reason to suspect a bug. -- Peter Geoghegan
On March 7, 2018 5:40:18 PM PST, Peter Geoghegan <pg@bowt.ie> wrote: >On Wed, Mar 7, 2018 at 5:16 PM, Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> FWIW that's usually written to the system log. Does dmesg say >something >> about the kill? > >While it would be nice to confirm that it was indeed the OOM killer, >either way the crash happened because SIGKILL was sent to a parallel >worker. There is no reason to suspect a bug. Not impossible there's a leak somewhere though. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Wed, Mar 7, 2018 at 7:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Mar 7, 2018 at 8:59 AM, Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote: 2018-03-07 19:24:44.263 IST [54400] LOG: background worker "parallel worker" (PID 54482) was terminated by signal 9: KilledThat looks like the background worker got killed by the OOM killer. How much memory do you have in the machine where this occurred?
I have ran the testcase in my local machine with below configurations:
Environment: CentOS 7(64bit)
HD : 100GB
RAM: 4GB
Processor: 4
I have nerrowdown the testcase as below, which also reproduce the same crash.
-- GUCs under postgres.conf
maintenance_work_mem = 8GB
./pgbench -i -s 500 -d postgres
postgres=# create index pgb_acc_idx3 on pgbench_accounts(aid, abalance,filler);
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>
--
With Regards,
Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation
The Postgres Database Company
--
With Regards,
Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation
The Postgres Database Company
On Thu, Mar 8, 2018 at 7:12 AM, Andres Freund <andres@anarazel.de> wrote:
On March 7, 2018 5:40:18 PM PST, Peter Geoghegan <pg@bowt.ie> wrote:
>On Wed, Mar 7, 2018 at 5:16 PM, Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> FWIW that's usually written to the system log. Does dmesg say
>something
>> about the kill?
>
>While it would be nice to confirm that it was indeed the OOM killer,
>either way the crash happened because SIGKILL was sent to a parallel
>worker. There is no reason to suspect a bug.
Not impossible there's a leak somewhere though.
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Prabhat Sahu <prabhat.sahu@enterprisedb.com> writes: > On Wed, Mar 7, 2018 at 7:51 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> That looks like the background worker got killed by the OOM killer. How >> much memory do you have in the machine where this occurred? > I have ran the testcase in my local machine with below configurations: > Environment: CentOS 7(64bit) > HD : 100GB > RAM: 4GB > Processor: 4 If you only have 4GB of physical RAM, it hardly seems surprising that trying to use 8GB of maintenance_work_mem would draw the wrath of the OOM killer. regards, tom lane
On Thu, Mar 8, 2018 at 11:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Prabhat Sahu <prabhat.sahu@enterprisedb.com> writes: >> On Wed, Mar 7, 2018 at 7:51 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> That looks like the background worker got killed by the OOM killer. How >>> much memory do you have in the machine where this occurred? > >> I have ran the testcase in my local machine with below configurations: >> Environment: CentOS 7(64bit) >> HD : 100GB >> RAM: 4GB >> Processor: 4 > > If you only have 4GB of physical RAM, it hardly seems surprising that > trying to use 8GB of maintenance_work_mem would draw the wrath of the > OOM killer. Yup. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company