Re: accounting for memory used for BufFile during hash joins - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: accounting for memory used for BufFile during hash joins |
Date | |
Msg-id | 20190508150844.rij36rtuk4lhvztw@development Whole thread Raw |
In response to | Re: accounting for memory used for BufFile during hash joins (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: accounting for memory used for BufFile during hash joins
|
List | pgsql-hackers |
On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote: > On Mon, May 6, 2019 at 8:15 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> > wrote: > > Nope, that's not how it works. It's the array of batches that gets > sliced, not the batches themselves. > > It does slightly increase the amount of data we need to shuffle between > the temp files, because we can't write the data directly to batches in > "future" slices. But that amplification is capped to ~2.2x (compared to > the ~1.4x in master) - I've shared some measurements in [1]. > > [1] > https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development > > Cool, I misunderstood. I looked at the code again today, and, at the email > thread where you measured "amplification". > Oh! I hope you're not too disgusted by the code in that PoC patch ;-) > In terms of how many times you write each tuple, is it accurate to > say that a tuple can now be spilled three times (in the worst case) > whereas, before, it could be spilled only twice? > > 1 - when building the inner side hashtable, tuple is spilled to a "slice" > file > 2 - (assuming the number of batches was increased) during execution, when > a tuple belonging to a later slice's spill file is found, it is re-spilled > to that slice's spill file > 3 - during execution, when reading from its slice file, it is re-spilled > (again) to its batch's spill file > Yes, that's mostly accurate understanding. Essentially this might add one extra step of "reshuffling" from the per-slice to per-batch files. > Is it correct that the max number of BufFile structs you will have > is equal to the number of slices + number of batches in a slice > because that is the max number of open BufFiles you would have at a > time? Yes. With the caveat that we need twice that number of BufFile structs, because we need them on both sides of the join. > By the way, applying v4 patch on master, in an assert build, I am tripping > some > asserts -- starting with > Assert(!file->readOnly); > in BufFileWrite Whoooops :-/ > One thing I was a little confused by was the nbatch_inmemory member > of the hashtable. The comment in ExecChooseHashTableSize says that > it is determining the number of batches we can fit in memory. I > thought that the problem was the amount of space taken up by the > BufFile data structure itself--which is related to the number of > open BufFiles you need at a time. This comment in > ExecChooseHashTableSize makes it sound like you are talking about > fitting more than one batch of tuples into memory at a time. I was > under the impression that you could only fit one batch of tuples in > memory at a time. I suppose you mean this chunk: /* * See how many batches we can fit into memory (driven mostly by size * of BufFile, with PGAlignedBlock being the largest part of that). * We need one BufFile for inner and outer side, so we count it twice * for each batch, and we stop once we exceed (work_mem/2). */ while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2 <= (work_mem * 1024L / 2)) nbatch_inmemory *= 2; Yeah, that comment is a bit confusing. What the code actually does is computing the largest "slice" of batches for which we can keep the BufFile structs in memory, without exceeding work_mem/2. Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure. > So, I was stepping through the code with work_mem set to the lower > bound, and in ExecHashIncreaseNumBatches, I got confused. > hashtable->nbatch_inmemory was 2 for me, thus, nbatch_tmp was 2 so, > I didn't meet this condition if (nbatch_tmp > > hashtable->nbatch_inmemory) since I just set nbatch_tmp using > hashtable->nbatch_inmemory So, I didn't increase the number of > slices, which is what I was expecting. What happens when > hashtable->nbatch_inmemory is equal to nbatch_tmp? > Ah, good catch. The condition you're refering to if (nbatch_tmp > hashtable->nbatch_inmemory) should actually be if (nbatch > hashtable->nbatch_inmemory) because the point is to initialize BufFile structs for the overflow files, and we need to do that once we cross nbatch_inmemory. And it turns out this actually causes the assert failures in regression tests, you reported earlier. It failed to initialize the overflow files in some cases, so the readOnly flag seemed to be set. Attached is an updated patch, fixing this. I tried to clarify some of the comments too, and I fixed another bug I found while running the regression tests. It's still very much a crappy PoC code, though. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: