Re: 9.5: Memory-bounded HashAgg - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: 9.5: Memory-bounded HashAgg |
Date | |
Msg-id | 5407A592.3060209@fuzzy.cz Whole thread Raw |
In response to | Re: 9.5: Memory-bounded HashAgg (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: 9.5: Memory-bounded HashAgg
Re: 9.5: Memory-bounded HashAgg |
List | pgsql-hackers |
On 20.8.2014 20:32, Robert Haas wrote: > On Sun, Aug 17, 2014 at 1:17 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> Being able to batch inner and outer relations in a matching way is >> certainly one of the reasons why hashjoin uses that particular scheme. >> There are other reasons, though - for example being able to answer 'Does >> this group belong to this batch?' quickly, and automatically update >> number of batches. >> >> I'm not saying the lookup is extremely costly, but I'd be very surprised >> if it was as cheap as modulo on a 32-bit integer. Not saying it's the >> dominant cost here, but memory bandwidth is quickly becoming one of the >> main bottlenecks. > > Well, I think you're certainly right that a hash table lookup is more > expensive than modulo on a 32-bit integer; so much is obvious. But if > the load factor is not too large, I think that it's not a *lot* more > expensive, so it could be worth it if it gives us other advantages. Yes, that may be true. I'm not opposed to Jeff's approach in general - it's certainly a nice solution for cases with fixed size of the aggregate states. But I still don't see how it could handle the aggregates with growing aggregate state (which is the case that troubles me, because that's what we see in our workloads). > As I see it, the advantage of Jeff's approach is that it doesn't > really matter whether our estimates are accurate or not. We don't > have to decide at the beginning how many batches to do, and then > possibly end up using too much or too little memory per batch if we're > wrong; we can let the amount of memory actually used during execution > determine the number of batches. That seems good. Of course, a hash Yes. I think that maybe we could use Jeff's approach even for 'growing aggregate state' case, assuming we can serialize the aggregate states and release the memory properly. First, the problem with the current hash table used in HashAggregate (i.e. dynahash) is that it never actually frees memory - when you do HASH_REMOVE it only moves it to a list of entries for future use. Imagine a workload where you initially see only 1 tuple for each group before work_mem gets full. At that point you stop adding new groups, but the current ones will grow. Even if you know how to serialize the aggregate states (which we don't), you're in trouble because the initial state is small (only 1 tuple was passed to the group) and most of the memory is stuck in dynahash. > join can increase the number of batches on the fly, but only by > doubling it, so you might go from 4 batches to 8 when 5 would really > have been enough. And a hash join also can't *reduce* the number of > batches on the fly, which might matter a lot. Getting the number of > batches right avoids I/O, which is a lot more expensive than CPU. Regarding the estimates, I don't see much difference between the two approaches when handling this issue. It's true you can wait with deciding how many partitions (aka batches) to create until work_mem is full, at which point you have more information than at the very beginning. You know how many tuples you've already seen, how many tuples you expect (which is however only an estimate etc.). And you may use that to estimate the number of partitions to create. That however comes at a cost - it's not really a memory-bounded hash aggregate, because you explicitly allow exceeding work_mem as more tuples for existing groups arrive. Also, no one really says the initial estimate of how many tuples will be aggregated is correct. It's about as unreliable as the group count estimate. So how exactly are you going to estimate the partitions? Considering this, I doubt being able to choose arbitrary number of partitions (instead of only powers of 2) is really an advantage. Reducing the number of partitions might matter, but in my experience most estimation errors are underestimations. Because we assume independence where in practice columns are dependent, etc. I agree that getting the batches right is important, but OTOH when using hash join using more smaller batches is often significantly faster than using one large one. So it depends. Whe I think we should prevent is under-estimating the number of batches, because in that case you have to read the whole batch, write part of it again and then read it again. Instead of just writing it once (into two files). Reading a tuple from a batch only to write it to another batch is not really efficient. >>> But the situation here isn't comparable, because there's only one >>> input stream. I'm pretty sure we'll want to keep track of which >>> transition states we've spilled due to lack of memory as opposed to >>> those which were never present in the table at all, so that we can >>> segregate the unprocessed tuples that pertain to spilled transition >>> states from the ones that pertain to a group we haven't begun yet. >> >> Why would that be necessary or useful? I don't see a reason for tracking >> that / segregating the tuples. > > Suppose there are going to be three groups: A, B, C. Each is an > array_agg(), and they're big, so only of them will fit in work_mem at > a time. However, we don't know that at the beginning, either because > we don't write the code to try or because we do write that code but > our cardinality estimates are way off; instead, we're under the > impression that all four will fit in work_mem. So we start reading > tuples. We see values for A and B, but we don't see any values for C > because those all occur later in the input. Eventually, we run short > of memory and cut off creation of new groups. Any tuples for C are > now going to get written to a tape from which we'll later reread them. > After a while, even that proves insufficient and we spill the > transition state for B to disk. Any further tuples that show up for C > will need to be written to tape as well. We continue processing and > finish group A. > > Now it's time to do batch #2. Presumably, we begin by reloading the > serialized transition state for group B. To finish group B, we must > look at all the tuples that might possibly fall in that group. If all > of the remaining tuples are on a single tape, we'll have to read all > the tuples in group B *and* all the tuples in group C; we'll > presumably rewrite the tuples that are not part of this batch onto a > new tape, which we'll then process in batch #3. But if we took > advantage of the first pass through the input to put the tuples for > group B on one tape and the tuples for group C on another tape, we can > be much more efficient - just read the remaining tuples for group B, > not mixed with anything else, and then read a separate tape for group > C. OK, I understand the idea. However I don't think it makes much sense to segregate every little group - that's a perfect fit for batching. What might be worth segregating are exceptionally large groups, because that what may cause batching inefficient - for example when a group is larger than work_mem, it will result in a batch per group (even if those remaining groups are tiny). But we have no way to identify this group, because we have no way to determine the size of the state. What we might do is assume that the size is proportional to number of tuples, and segregate only those largest groups. This can easily be done with hashjoin-like batching - adding ntuples, isSegregated and skewBatchId the AggHashEntry. The placeholder (only the hash entry will be stored in the batch, but the actual state etc. will be stored separetely). This is a bit similar to how hashjoin handles skew buckets. It's true that Jeff's approach handles this somewhat better, but at the cost of not really bounding the memory consumed by HashAggregate. Tomas
pgsql-hackers by date: