Re: 9.5: Memory-bounded HashAgg - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: 9.5: Memory-bounded HashAgg
Date
Msg-id 53ED487E.1060705@fuzzy.cz
Whole thread Raw
In response to Re: 9.5: Memory-bounded HashAgg  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-hackers
On 14.8.2014 21:47, Tomas Vondra wrote:
> On 14.8.2014 18:54, Jeff Davis wrote:
>> On Thu, 2014-08-14 at 16:17 +0200, Tomas Vondra wrote:
>>> Either it belongs to the current batch (and either it's in the hash
>>> table, or you add it there), or it's not - in that case write it to a
>>> temp file.
>>
>> I think the part you left out is that you need two files per batch: one
>> for the dumped-out partially-computed state values, and one for the
>> tuples.
>>
>> In other words, you haven't really discussed the step where you reunite
>> the tuples with that partially-computed state.
> 
> No, that's not how the serialize/deserialize should work. The aggregate
> needs to store the state as-is, so that after deserializing it gets
> pretty much the same thing.
> 
> For example, for 'median' the state is the list of all the values
> received so far, and when serializing it you have to write all the
> values out. After deserializing it, you will get the same list of values.
> 
> Some aggregates may use complex data structures that may need more
> elaborate serialize.
> 
>>> For sure, it's not for free - it may write to quite a few files. Is it
>>> more expensive than what you propose? I'm not sure about that. With
>>> your batching scheme, you'll end up with lower number of large batches,
>>> and you'll need to read and split them, possibly repeatedly. The
>>> batching scheme from hashjoin minimizes this.
>>
>> My approach only has fewer batches if it elects to have fewer batches,
>> which might happen for two reasons:
>>  1. A cardinality misestimate. This certainly could happen, but we do
>> have useful numbers to work from (we know the number of tuples and
>> distincts that we've read so far), so it's far from a blind guess. 
>>  2. We're concerned about the random I/O from way too many partitions.
> 
> OK. We can't really do much with the cardinality estimate.
> 
> As for the random IO concerns, I did a quick test to see how this
> behaves. I used a HP ProLiant DL380 G5 (i.e. a quite old machine, from
> 2006-09 if I'm not mistaken). 16GB RAM, RAID10 on 6 x 10k SAS drives,
> 512MB write cache. So a quite lousy machine, considering today's standards.
> 
> I used a simple C program (attached) that creates N files, and writes
> into them in a round-robin fashion until a particular file size is
> reached. I opted for 64GB total size, 1kB writes.
> 
>     ./iotest filecount filesize writesize
> 
> File size is in MB, writesize is in bytes. So for example this writes 64
> files, each 1GB, using 512B writes.
> 
>     ./iotest 64 1024 512
> 
> Measured is duration before/after fsync (in seconds):
> 
>     files   |    file size  |  before  fsync |  after fsync
>    ---------------------------------------------------------
>     32      |      2048     |        290.16  |      294.33
>     64      |      1024     |        264.68  |      267.60
>     128     |       512     |        278.68  |      283.44
>     256     |       256     |        332.11  |      338.45
>     1024    |        64     |        419.91  |      425.48
>     2048    |        32     |        450.37  |      455.20
> 
> So while there is a difference, I don't think it's the 'random I/O wall'
> as usually observed on rotational drives. Also, this is 2.6.32 kernel,
> and my suspicion is that with a newer one the behaviour would be better.
> 
> I also have an SSD in that machine (Intel S3700), so I did the same test
> with these results:
> 
>     files   |    file size  |  before  fsync |  after fsync
>    ---------------------------------------------------------
>     32      |      2048     |        445.05  |      464.73
>     64      |      1024     |        447.32  |      466.56
>     128     |       512     |        446.63  |      465.90
>     256     |       256     |        446.64  |      466.19
>     1024    |        64     |        511.85  |      523.24
>     2048    |        32     |        579.92  |      590.76
> 
> So yes, the number of files matter, but I don't think it's strong enough
> to draw a clear line on how many batches we allow. Especially
> considering how old this machine is (on 3.x kernels, we usually see much
> better performance in I/O intensive conditions).

And just for fun, I did the same test on a workstation with 8GB of RAM,
S3700 SSD, i5-2500 CPU and kernel 3.12. That is, a more modern
hardware / kernel / ... compared to the machine above.

For a test writing 32GB of data (4x the RAM), I got these results:
   files   | file size  | before  fsync |  after fsync  ------------------------------------------------------
32|    1024    |     171.27    |    175.96        64 |     512    |     165.57    |    170.12       128 |     256    |
  165.29    |    169.95       256 |     128    |     164.69    |    169.62       512 |      64    |     163.98    |
168.90     1024 |      32    |     165.44    |    170.50      2048 |      16    |     165.97    |    171.35      4096 |
     8    |     166.55    |    173.26
 

So, no sign of slowdown at all, in this case. I don't have a rotational
disk in the machine at this moment, so I can't repeat the test. But I
don't expect the impact to be much worse than for the old machine.

I'm not sure whether this proves we should not worry about the number of
batches at all - the old kernel / machines will be with us for some
time. However, I'm not a fan of artificialy limiting the implementation
because of a decade old machines either.

Tomas





pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: jsonb format is pessimal for toast compression
Next
From: Josh Berkus
Date:
Subject: Re: jsonb format is pessimal for toast compression