Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets - Mailing list pgsql-hackers
From | Lawrence, Ramon |
---|---|
Subject | Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets |
Date | |
Msg-id | 6EEA43D22289484890D119821101B1DF2C182B@exchange20.mercury.ad.ubc.ca Whole thread Raw |
In response to | Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets ("Robert Haas" <robertmhaas@gmail.com>) |
Responses |
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
|
List | pgsql-hackers |
> -----Original Message----- > From: Robert Haas [mailto:robertmhaas@gmail.com] > I looked at this some more. I'm a little concerned about the way > we're maintaining the in-memory hash table. Since the highest legal > statistics target is now 10,000, it's possible that we could have two > orders of magnitude more MCVs than what you're expecting. As I read > the code, that could lead to construction of an in-memory hash table > with 64K slots. On a 32-bit machine, I believe that works out to 16 > bytes per partition (12 and 4), which is a 1MB hash table. That's not > necessarily problematic, except that I don't think you're considering > the size of the hash table itself when evaluating whether you are > blowing out work_mem, and the default size of work_mem is 1MB. I totally agree that 10,000 MCVs changes things. Ideally, these 10,000 MCVs should be kept in memory because they will join with the most tuples. However, the size of the MCV hash table (as you point out) can be bigger than work_mem *by itself* not even considering the tuples in the table or in the in-memory batch. Supporting that many MCVs would require more modifications to the hash join algorithm. 100 MCVs should be able to fit in memory though. Since the number of batches is rounded to a power of 2, there is often some hash_table_bytes that are not used by the in-memory batch that can be "used" to store the MCV table. The absolute size of the memory used should also be reasonable (depending on the tuple size in bytes). So, basically, we have a decision to make whether to try support a larger number of MCVs or cap it at a reasonable number like a 100. You can come up with situations where using all 10,000 MCVs is good (for instance if all MCVs have frequency 1/10000), but I expect 100 MCVs will capture the majority of the cases as usually the top 100 MCVs are significantly more frequent than later MCVs. I now also see that the code should be changed to keep track of the MCV bytes separately from hashtable->spaceUsed as this is used to determine when to dynamically increase the number of batches. > I also don't really understand why we're trying to control the size of > the hash table by flushing tuples after the fact. Right now, when the > in-memory table fills up, we just keep adding tuples to it, which in > turn forces us to flush out other tuples to keep the size down. This > seems quite inefficient - not only are we doing a lot of unnecessary > allocating and freeing, but those flushed slots in the hash table > degrade performance (because they don't stop the scan for an empty > slot). It seems like we could simplify things considerably by adding > tuples to the in-memory hash table only to the point where the next > tuple would blow it out. Once we get to that point, we can skip the > isAMostCommonValue() test and send any future tuples straight to temp > files. (This would also reduce the memory consumption of the > in-memory table by a factor of two.) In the ideal case, we select a number of MCVs to support that we know will always fit in memory. The flushing is used to deal with the case where we are doing a many-to-many join and there may be multiple tuples with the given MCV value in the build relation. The issue with building the MCV table is that the hash operator will not be receiving tuples in MCV frequency order. It is possible that the MCV table is filled up with tuples of less frequent MCVs when a more frequent MCV tuple arrives. In that case, we would like to keep the more frequent MCV and bump one of the less frequent MCVs. > We could potentially improve on this even further if we can estimate > in advance how many MCVs we can fit into the in-memory hash table > before it gets blown out. If, for example, we have only 1MB of > work_mem but there 10,000 MCVs, getMostCommonValues() might decide to > only hash the first 1,000 MCVs. Even if we still blow out the > in-memory hash table, the earlier MCVs are more frequent than the > later MCVs, so the ones that actually make it into the table are > likely to be more beneficial. I'm not sure exactly how to do this > tuning though, since we'd need to approximate the size of the > tuples... I guess the query planner makes some effort to estimate that > but I'm not sure how to get at it. The number of batches (nbatch), inner_rel_bytes, and hash_table_bytes are calculated in ExecChooseHashTableSize in nodeHash.c. The number of bytes "free" not allocated to the in-memory batch is then: hash_table_bytes - inner_rel_bytes/nbatch Depending on the power of 2 rounding of nbatch, this may be almost 0 or quite large. You could change the calculation of nbatch or try to resize the in-memory batch, but that opens up a can of worms. It may be best to assume a small number of MCVs 10 or 100. > > > However, the join with Part and LineItem *should* show a benefit but may > > not because of a limitation of the patch implementation (not the idea). > > The MCV optimization is only enabled currently when the probe side is a > > sequential scan. This limitation is due to our current inability to > > determine a stats tuple of the join attribute on the probe side for > > other operators. (This should be possible - help please?). > > Not sure how to get at this either, but I'll take a look and see if I > can figure it out. After more digging, we can extract the original relation id and attribute id of the join attribute using the instance variables varnoold and varoattno of Var. It is documented that these variables are just kept around for debugging, but they are definitely useful here. New code would be: relid = getrelid(variable->varnoold, estate->es_range_table); relattnum = variable->varoattno; Thanks for working with us on the patch. Happy Holidays Everyone, Ramon Lawrence
pgsql-hackers by date: