Thread: Vacuum: allow usage of more than 1GB of work mem

Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
The attached patch allows setting maintainance_work_mem or
autovacuum_work_mem higher than 1GB (and be effective), by turning the
allocation of the dead_tuples into a huge allocation.

This results in fewer index scans for heavily bloated tables, and
could be a lifesaver in many situations (in particular, the situation
I'm living right now in production, where we don't have enough room
for a vacuum full, and have just deleted 75% of a table to make room
but have to rely on regular lazy vacuum to free the space).

The patch also makes vacuum free the dead_tuples before starting
truncation. It didn't seem necessary to hold onto it beyond that
point, and it might help give the OS more cache, especially if work
mem is configured very high to avoid multiple index scans.

Tested with pgbench scale 4000 after deleting the whole
pgbench_accounts table, seemed to work fine.

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Simon Riggs
Date:
On 3 September 2016 at 04:25, Claudio Freire <klaussfreire@gmail.com> wrote:
> The attached patch allows setting maintainance_work_mem or
> autovacuum_work_mem higher than 1GB (and be effective), by turning the
> allocation of the dead_tuples into a huge allocation.
>
> This results in fewer index scans for heavily bloated tables, and
> could be a lifesaver in many situations (in particular, the situation
> I'm living right now in production, where we don't have enough room
> for a vacuum full, and have just deleted 75% of a table to make room
> but have to rely on regular lazy vacuum to free the space).

This part looks fine. I'm inclined to commit the attached patch soon.

> The patch also makes vacuum free the dead_tuples before starting
> truncation. It didn't seem necessary to hold onto it beyond that
> point, and it might help give the OS more cache, especially if work
> mem is configured very high to avoid multiple index scans.

How long does that part ever take? Is there any substantial gain from this?

Lets discuss that as a potential second patch.

> Tested with pgbench scale 4000 after deleting the whole
> pgbench_accounts table, seemed to work fine.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 3 September 2016 at 04:25, Claudio Freire <klaussfreire@gmail.com> wrote:
>> The patch also makes vacuum free the dead_tuples before starting
>> truncation. It didn't seem necessary to hold onto it beyond that
>> point, and it might help give the OS more cache, especially if work
>> mem is configured very high to avoid multiple index scans.
>
> How long does that part ever take? Is there any substantial gain from this?
>
> Lets discuss that as a potential second patch.

In the test case I mentioned, it takes longer than the vacuum part itself.

Other than freeing RAM there's no gain. I didn't measure any speed
difference while testing, but that's probably because the backward
scan doesn't benefit from the cache, but other activity on the system
might. So, depending on the workload on the server, extra available
RAM may be a significant gain on its own or not. It just didn't seem
there was a reason to keep that RAM reserved, especially after making
it a huge allocation.

I'm fine either way. I can remove that from the patch or leave it
as-is. It just seemed like a good idea at the time.



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Mon, Sep 5, 2016 at 11:50 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 3 September 2016 at 04:25, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> The patch also makes vacuum free the dead_tuples before starting
>>> truncation. It didn't seem necessary to hold onto it beyond that
>>> point, and it might help give the OS more cache, especially if work
>>> mem is configured very high to avoid multiple index scans.
>>
>> How long does that part ever take? Is there any substantial gain from this?
>>
>> Lets discuss that as a potential second patch.
>
> In the test case I mentioned, it takes longer than the vacuum part itself.
>
> Other than freeing RAM there's no gain. I didn't measure any speed
> difference while testing, but that's probably because the backward
> scan doesn't benefit from the cache, but other activity on the system
> might. So, depending on the workload on the server, extra available
> RAM may be a significant gain on its own or not. It just didn't seem
> there was a reason to keep that RAM reserved, especially after making
> it a huge allocation.
>
> I'm fine either way. I can remove that from the patch or leave it
> as-is. It just seemed like a good idea at the time.


Rebased and split versions attached

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Simon Riggs
Date:
On 5 September 2016 at 15:50, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 3 September 2016 at 04:25, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> The patch also makes vacuum free the dead_tuples before starting
>>> truncation. It didn't seem necessary to hold onto it beyond that
>>> point, and it might help give the OS more cache, especially if work
>>> mem is configured very high to avoid multiple index scans.
>>
>> How long does that part ever take? Is there any substantial gain from this?
>>
>> Lets discuss that as a potential second patch.
>
> In the test case I mentioned, it takes longer than the vacuum part itself.

Please provide a test case and timings so we can see what's happening.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Mon, Sep 5, 2016 at 5:36 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 5 September 2016 at 15:50, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> On 3 September 2016 at 04:25, Claudio Freire <klaussfreire@gmail.com> wrote:
>>>> The patch also makes vacuum free the dead_tuples before starting
>>>> truncation. It didn't seem necessary to hold onto it beyond that
>>>> point, and it might help give the OS more cache, especially if work
>>>> mem is configured very high to avoid multiple index scans.
>>>
>>> How long does that part ever take? Is there any substantial gain from this?
>>>
>>> Lets discuss that as a potential second patch.
>>
>> In the test case I mentioned, it takes longer than the vacuum part itself.
>
> Please provide a test case and timings so we can see what's happening.

The referenced test case is the one I mentioned on the OP:

- createdb pgbench
- pgbench -i -s 4000 pgbench
- psql pgbench -c 'delete from pgbench_accounts;'
- vacuumdb -v -t pgbench_accounts pgbench

fsync=off, autovacuum=off, maintainance_work_mem=4GB

From what I remember, it used ~2.7GB of RAM up until the truncate
phase, where it freed it. It performed a single index scan over the
PK.

I don't remember timings, and I didn't take them, so I'll have to
repeat the test to get them. It takes all day and makes my laptop
unusably slow, so I'll post them later, but they're not very
interesting. The only interesting bit is that it does a single index
scan instead of several, which on TB-or-more tables it's kinda nice.

Btw, without a further patch to prefetch pages on the backward scan
for truncate, however, my patience ran out before it finished
truncating. I haven't submitted that patch because there was an
identical patch in an older thread that was discussed and more or less
rejected since it slightly penalized SSDs. While I'm confident my
version of the patch is a little easier on SSDs, I haven't got an SSD
at hand to confirm, so that patch is still waiting on the queue until
I get access to an SSD. The tests I'll make include that patch, so the
timing regarding truncate won't be representative of HEAD (I really
can't afford to run the tests on a scale=4000 pgbench without that
patch, it crawls, and smaller scales don't fill the dead_tuples
array).



Re: Vacuum: allow usage of more than 1GB of work mem

From
Simon Riggs
Date:
On 5 September 2016 at 21:58, Claudio Freire <klaussfreire@gmail.com> wrote:

>>>> How long does that part ever take? Is there any substantial gain from this?

> Btw, without a further patch to prefetch pages on the backward scan
> for truncate, however, my patience ran out before it finished
> truncating. I haven't submitted that patch because there was an
> identical patch in an older thread that was discussed and more or less
> rejected since it slightly penalized SSDs.

OK, thats enough context. Sorry for being forgetful on that point.

Please post that new patch also.


This whole idea of backwards scanning to confirm truncation seems
wrong. What we want is an O(1) solution. Thinking.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

From
Jim Nasby
Date:
On 9/4/16 1:46 AM, Simon Riggs wrote:
>> > The patch also makes vacuum free the dead_tuples before starting
>> > truncation. It didn't seem necessary to hold onto it beyond that
>> > point, and it might help give the OS more cache, especially if work
>> > mem is configured very high to avoid multiple index scans.
> How long does that part ever take? Is there any substantial gain from this?

If you're asking about how long the dealloc takes, we're going to have 
to pay that cost anyway when the context gets destroyed/reset, no? Doing 
that sooner rather than later certainly seems like a good idea since 
we've seen that truncation can take quite some time. Might as well give 
the memory back to the OS ASAP.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461



Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Sat, Sep 3, 2016 at 8:55 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> The attached patch allows setting maintainance_work_mem or
> autovacuum_work_mem higher than 1GB (and be effective), by turning the
> allocation of the dead_tuples into a huge allocation.
>
> This results in fewer index scans for heavily bloated tables, and
> could be a lifesaver in many situations (in particular, the situation
> I'm living right now in production, where we don't have enough room
> for a vacuum full, and have just deleted 75% of a table to make room
> but have to rely on regular lazy vacuum to free the space).
>
> The patch also makes vacuum free the dead_tuples before starting
> truncation. It didn't seem necessary to hold onto it beyond that
> point, and it might help give the OS more cache, especially if work
> mem is configured very high to avoid multiple index scans.
>
> Tested with pgbench scale 4000 after deleting the whole
> pgbench_accounts table, seemed to work fine.

The problem with this is that we allocate the entire amount of
maintenance_work_mem even when the number of actual dead tuples turns
out to be very small.  That's not so bad if the amount of memory we're
potentially wasting is limited to ~1 GB, but it seems pretty dangerous
to remove the 1 GB limit, because somebody might have
maintenance_work_mem set to tens or hundreds of gigabytes to speed
index creation, and allocating that much space for a VACUUM that
encounters 1 dead tuple does not seem like a good plan.

What I think we need to do is make some provision to initially
allocate only a small amount of memory and then grow the allocation
later if needed.  For example, instead of having
vacrelstats->dead_tuples be declared as ItemPointer, declare it as
ItemPointer * and allocate the array progressively in segments.  I'd
actually argue that the segment size should be substantially smaller
than 1 GB, like say 64MB; there are still some people running systems
which are small enough that allocating 1 GB when we may need only 6
bytes can drive the system into OOM.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Sun, Sep 4, 2016 at 8:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 9/4/16 1:46 AM, Simon Riggs wrote:
>>>
>>> > The patch also makes vacuum free the dead_tuples before starting
>>> > truncation. It didn't seem necessary to hold onto it beyond that
>>> > point, and it might help give the OS more cache, especially if work
>>> > mem is configured very high to avoid multiple index scans.
>>
>> How long does that part ever take? Is there any substantial gain from
>> this?
>
>
> If you're asking about how long the dealloc takes, we're going to have to
> pay that cost anyway when the context gets destroyed/reset, no? Doing that
> sooner rather than later certainly seems like a good idea since we've seen
> that truncation can take quite some time. Might as well give the memory back
> to the OS ASAP.

AFAIK, except on debug builds where it has to memset the whole thing,
the cost is constant (unrelated to the allocated block size), so it
should be rather small in this context.


On Tue, Sep 6, 2016 at 1:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Sep 3, 2016 at 8:55 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> The attached patch allows setting maintainance_work_mem or
>> autovacuum_work_mem higher than 1GB (and be effective), by turning the
>> allocation of the dead_tuples into a huge allocation.
>>
>> This results in fewer index scans for heavily bloated tables, and
>> could be a lifesaver in many situations (in particular, the situation
>> I'm living right now in production, where we don't have enough room
>> for a vacuum full, and have just deleted 75% of a table to make room
>> but have to rely on regular lazy vacuum to free the space).
>>
>> The patch also makes vacuum free the dead_tuples before starting
>> truncation. It didn't seem necessary to hold onto it beyond that
>> point, and it might help give the OS more cache, especially if work
>> mem is configured very high to avoid multiple index scans.
>>
>> Tested with pgbench scale 4000 after deleting the whole
>> pgbench_accounts table, seemed to work fine.
>
> The problem with this is that we allocate the entire amount of
> maintenance_work_mem even when the number of actual dead tuples turns
> out to be very small.  That's not so bad if the amount of memory we're
> potentially wasting is limited to ~1 GB, but it seems pretty dangerous
> to remove the 1 GB limit, because somebody might have
> maintenance_work_mem set to tens or hundreds of gigabytes to speed
> index creation, and allocating that much space for a VACUUM that
> encounters 1 dead tuple does not seem like a good plan.
>
> What I think we need to do is make some provision to initially
> allocate only a small amount of memory and then grow the allocation
> later if needed.  For example, instead of having
> vacrelstats->dead_tuples be declared as ItemPointer, declare it as
> ItemPointer * and allocate the array progressively in segments.  I'd
> actually argue that the segment size should be substantially smaller
> than 1 GB, like say 64MB; there are still some people running systems
> which are small enough that allocating 1 GB when we may need only 6
> bytes can drive the system into OOM.

This would however incur the cost of having to copy the whole GB-sized
chunk every time it's expanded. It woudln't be cheap.

I've monitored the vacuum as it runs and the OS doesn't map the whole
block unless it's touched, which it isn't until dead tuples are found.
Surely, if overcommit is disabled (as it should), it could exhaust the
virtual address space if set very high, but it wouldn't really use the
memory unless it's needed, it would merely reserve it.

To fix that, rather than repalloc the whole thing, dead_tuples would
have to be an ItemPointer** of sorted chunks. That'd be a
significantly more complex patch, but at least it wouldn't incur the
memcpy. I could attempt that, but I don't see the difference between
vacuum and create index in this case. Both could allocate a huge chunk
of the virtual address space if maintainance work mem says so, both
proportional to the size of the table. I can't see how that could take
any DBA by surprise.

A sensible compromise could be dividing the maintainance_work_mem by
autovacuum_max_workers when used in autovacuum, as is done for cost
limits, to protect those that set both rather high.



Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Tue, Sep 6, 2016 at 10:28 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> The problem with this is that we allocate the entire amount of
>> maintenance_work_mem even when the number of actual dead tuples turns
>> out to be very small.  That's not so bad if the amount of memory we're
>> potentially wasting is limited to ~1 GB, but it seems pretty dangerous
>> to remove the 1 GB limit, because somebody might have
>> maintenance_work_mem set to tens or hundreds of gigabytes to speed
>> index creation, and allocating that much space for a VACUUM that
>> encounters 1 dead tuple does not seem like a good plan.
>>
>> What I think we need to do is make some provision to initially
>> allocate only a small amount of memory and then grow the allocation
>> later if needed.  For example, instead of having
>> vacrelstats->dead_tuples be declared as ItemPointer, declare it as
>> ItemPointer * and allocate the array progressively in segments.  I'd
>> actually argue that the segment size should be substantially smaller
>> than 1 GB, like say 64MB; there are still some people running systems
>> which are small enough that allocating 1 GB when we may need only 6
>> bytes can drive the system into OOM.
>
> This would however incur the cost of having to copy the whole GB-sized
> chunk every time it's expanded. It woudln't be cheap.

No, I don't want to end up copying the whole array; that's what I
meant by allocating it progressively in segments.  Something like what
you go on to propose.

> I've monitored the vacuum as it runs and the OS doesn't map the whole
> block unless it's touched, which it isn't until dead tuples are found.
> Surely, if overcommit is disabled (as it should), it could exhaust the
> virtual address space if set very high, but it wouldn't really use the
> memory unless it's needed, it would merely reserve it.

Yeah, but I've seen actual breakage from exactly this issue on
customer systems even with the 1GB limit, and when we start allowing
100GB it's going to get a whole lot worse.

> To fix that, rather than repalloc the whole thing, dead_tuples would
> have to be an ItemPointer** of sorted chunks. That'd be a
> significantly more complex patch, but at least it wouldn't incur the
> memcpy.

Right, this is what I had in mind.  I don't think this is actually
very complicated, because the way we use this array is really simple.
We basically just keep appending to the array until we run out of
space, and that's not very hard to implement with an array-of-arrays.
The chunks are, in some sense, sorted, as you say, but you don't need
to do qsort() or anything like that.  You're just replacing a single
flat array with a data structure that can be grown incrementally in
fixed-size chunks.

> I could attempt that, but I don't see the difference between
> vacuum and create index in this case. Both could allocate a huge chunk
> of the virtual address space if maintainance work mem says so, both
> proportional to the size of the table. I can't see how that could take
> any DBA by surprise.

Really?  CREATE INDEX isn't going to allocate more storage space than
the size of the data actually being sorted, because tuplesort.c is
smart about that kind of thing.  But VACUUM will very happily allocate
vastly more memory than the number of dead tuples.  It is thankfully
smart enough not to allocate more storage than the number of line
pointers that could theoretically exist in a relation of the given
size, but that only helps for very small relations.  In a large
relation that divergence between the amount of storage space that
could theoretically be needed and the amount that is actually needed
is likely to be extremely high.  1 TB relation = 2^27 blocks, each of
which can contain MaxHeapTuplesPerPage dead line pointers.  On my
system, MaxHeapTuplesPerPage is 291, so that's 291 * 2^27 possible
dead line pointers, which at 6 bytes each is 291 * 6 * 2^27 = ~218GB,
but the expected number of dead line pointers is much less than that.
Even if this is a vacuum triggered by autovacuum_vacuum_scale_factor
and you're using the default of 0.2 (probably too high for such a
large table), assuming there are about 60 tuples for page (which is
what I get with pgbench -i) the table would have about 2^27 * 60 = 7.7
billion tuples of which 1.5 billion would be dead, meaning we need
about 9-10GB of space to store all of those dead tuples.  Allocating
as much as 218GB when we need 9-10GB is going to sting, and I don't
see how you will get a comparable distortion with CREATE INDEX.  I
might be missing something, though.

There's no real issue when there's only one process running on the
system at a time.  If the user set maintenance_work_mem to an amount
of memory that he can't afford to pay even once, then that's simple
misconfiguration and it's not really our problem.  The issue is that
when there are 3 or potentially more VACUUM processes running plus a
CREATE INDEX or two at the same time.  If you set maintenance_work_mem
to a value that is large enough to make the CREATE INDEX run fast, now
with your patch that is also going to cause each VACUUM process to
gobble up lots of extra memory that it probably doesn't need, and now
you may well start to get failures.  I've seen this happen even with
the current 1GB limit, though you need a pretty small system - e.g.
8GB RAM - for it to be a problem.  I think it is really really likely
to cause big problems for us if we dramatically increase that limit
without making the allocation algorithm smarter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Tue, Sep 6, 2016 at 2:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I could attempt that, but I don't see the difference between
>> vacuum and create index in this case. Both could allocate a huge chunk
>> of the virtual address space if maintainance work mem says so, both
>> proportional to the size of the table. I can't see how that could take
>> any DBA by surprise.
>
> Really?  CREATE INDEX isn't going to allocate more storage space than
> the size of the data actually being sorted, because tuplesort.c is
> smart about that kind of thing.  But VACUUM will very happily allocate
> vastly more memory than the number of dead tuples.  It is thankfully
> smart enough not to allocate more storage than the number of line
> pointers that could theoretically exist in a relation of the given
> size, but that only helps for very small relations.  In a large
> relation that divergence between the amount of storage space that
> could theoretically be needed and the amount that is actually needed
> is likely to be extremely high.  1 TB relation = 2^27 blocks, each of
> which can contain MaxHeapTuplesPerPage dead line pointers.  On my
> system, MaxHeapTuplesPerPage is 291, so that's 291 * 2^27 possible
> dead line pointers, which at 6 bytes each is 291 * 6 * 2^27 = ~218GB,
> but the expected number of dead line pointers is much less than that.
> Even if this is a vacuum triggered by autovacuum_vacuum_scale_factor
> and you're using the default of 0.2 (probably too high for such a
> large table), assuming there are about 60 tuples for page (which is
> what I get with pgbench -i) the table would have about 2^27 * 60 = 7.7
> billion tuples of which 1.5 billion would be dead, meaning we need
> about 9-10GB of space to store all of those dead tuples.  Allocating
> as much as 218GB when we need 9-10GB is going to sting, and I don't
> see how you will get a comparable distortion with CREATE INDEX.  I
> might be missing something, though.

CREATE INDEX could also allocate 218GB, you just need to index enough
columns and you'll get that.

Aside from the fact that CREATE INDEX will only allocate what is going
to be used and VACUUM will overallocate, the potential to fully
allocate the amount given is still there for both cases.

> There's no real issue when there's only one process running on the
> system at a time.  If the user set maintenance_work_mem to an amount
> of memory that he can't afford to pay even once, then that's simple
> misconfiguration and it's not really our problem.  The issue is that
> when there are 3 or potentially more VACUUM processes running plus a
> CREATE INDEX or two at the same time.  If you set maintenance_work_mem
> to a value that is large enough to make the CREATE INDEX run fast, now
> with your patch that is also going to cause each VACUUM process to
> gobble up lots of extra memory that it probably doesn't need, and now
> you may well start to get failures.  I've seen this happen even with
> the current 1GB limit, though you need a pretty small system - e.g.
> 8GB RAM - for it to be a problem.  I think it is really really likely
> to cause big problems for us if we dramatically increase that limit
> without making the allocation algorithm smarter.

Ok, a pity it will invalidate all the testing already done though (I
was almost done with the testing).

I guess I'll send the results anyway.



Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Tue, Sep 6, 2016 at 11:22 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> CREATE INDEX could also allocate 218GB, you just need to index enough
> columns and you'll get that.
>
> Aside from the fact that CREATE INDEX will only allocate what is going
> to be used and VACUUM will overallocate, the potential to fully
> allocate the amount given is still there for both cases.

I agree with that, but I think there's a big difference between
allocating the memory only when it's needed and allocating it whether
it is needed or not.  YMMV, of course, but that's what I think....

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Yeah, but I've seen actual breakage from exactly this issue on
> customer systems even with the 1GB limit, and when we start allowing
> 100GB it's going to get a whole lot worse.

While it's not necessarily a bad idea to consider these things,
I think people are greatly overestimating the consequences of the
patch-as-proposed.  AFAICS, it does *not* let you tell VACUUM to
eat 100GB of workspace.  Note the line right in front of the one
being changed:
        maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);        maxtuples = Min(maxtuples, INT_MAX);
-        maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
+        maxtuples = Min(maxtuples, MaxAllocHugeSize / sizeof(ItemPointerData));

Regardless of what vac_work_mem is, we aren't gonna let you have more
than INT_MAX ItemPointers, hence 12GB at the most.  So the worst-case
increase from the patch as given is 12X.  Maybe that's enough to cause
bad consequences on some systems, but it's not the sort of disaster
Robert posits above.

It's also worth re-reading the lines just after this, which constrain
the allocation a whole lot more for small tables.  Robert comments:

> ...  But VACUUM will very happily allocate
> vastly more memory than the number of dead tuples.  It is thankfully
> smart enough not to allocate more storage than the number of line
> pointers that could theoretically exist in a relation of the given
> size, but that only helps for very small relations.  In a large
> relation that divergence between the amount of storage space that
> could theoretically be needed and the amount that is actually needed
> is likely to be extremely high.  1 TB relation = 2^27 blocks, each of
> which can contain MaxHeapTuplesPerPage dead line pointers.  On my
> system, MaxHeapTuplesPerPage is 291, so that's 291 * 2^27 possible
> dead line pointers, which at 6 bytes each is 291 * 6 * 2^27 = ~218GB,
> but the expected number of dead line pointers is much less than that.

If we think the expected number of dead pointers is so much less than
that, why don't we just decrease LAZY_ALLOC_TUPLES, and take a hit in
extra index vacuum cycles when we're wrong?

(Actually, what I'd be inclined to do is let it have MaxHeapTuplesPerPage
slots per page up till a few meg, and then start tailing off the
space-per-page, figuring that the law of large numbers will probably kick
in.)
        regards, tom lane



Re: Vacuum: allow usage of more than 1GB of work mem

From
Simon Riggs
Date:
On 6 September 2016 at 19:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Yeah, but I've seen actual breakage from exactly this issue on
>> customer systems even with the 1GB limit, and when we start allowing
>> 100GB it's going to get a whole lot worse.
>
> While it's not necessarily a bad idea to consider these things,
> I think people are greatly overestimating the consequences of the
> patch-as-proposed.  AFAICS, it does *not* let you tell VACUUM to
> eat 100GB of workspace.  Note the line right in front of the one
> being changed:
>
>          maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
>          maxtuples = Min(maxtuples, INT_MAX);
> -        maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
> +        maxtuples = Min(maxtuples, MaxAllocHugeSize / sizeof(ItemPointerData));
>
> Regardless of what vac_work_mem is, we aren't gonna let you have more
> than INT_MAX ItemPointers, hence 12GB at the most.  So the worst-case
> increase from the patch as given is 12X.  Maybe that's enough to cause
> bad consequences on some systems, but it's not the sort of disaster
> Robert posits above.

Is there a reason we can't use repalloc here?

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Tue, Sep 6, 2016 at 2:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Yeah, but I've seen actual breakage from exactly this issue on
>> customer systems even with the 1GB limit, and when we start allowing
>> 100GB it's going to get a whole lot worse.
>
> While it's not necessarily a bad idea to consider these things,
> I think people are greatly overestimating the consequences of the
> patch-as-proposed.  AFAICS, it does *not* let you tell VACUUM to
> eat 100GB of workspace.  Note the line right in front of the one
> being changed:
>
>          maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
>          maxtuples = Min(maxtuples, INT_MAX);
> -        maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
> +        maxtuples = Min(maxtuples, MaxAllocHugeSize / sizeof(ItemPointerData));
>
> Regardless of what vac_work_mem is, we aren't gonna let you have more
> than INT_MAX ItemPointers, hence 12GB at the most.  So the worst-case
> increase from the patch as given is 12X.  Maybe that's enough to cause
> bad consequences on some systems, but it's not the sort of disaster
> Robert posits above.

Hmm, OK.  Yes, that is a lot less bad.  (I think it's still bad.)

> If we think the expected number of dead pointers is so much less than
> that, why don't we just decrease LAZY_ALLOC_TUPLES, and take a hit in
> extra index vacuum cycles when we're wrong?

Because that's really inefficient.  Growing the array, even with a
stupid approach that copies all of the TIDs every time, is a heck of a
lot faster than incurring an extra index vac cycle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Tue, Sep 6, 2016 at 2:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 6 September 2016 at 19:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Yeah, but I've seen actual breakage from exactly this issue on
>>> customer systems even with the 1GB limit, and when we start allowing
>>> 100GB it's going to get a whole lot worse.
>>
>> While it's not necessarily a bad idea to consider these things,
>> I think people are greatly overestimating the consequences of the
>> patch-as-proposed.  AFAICS, it does *not* let you tell VACUUM to
>> eat 100GB of workspace.  Note the line right in front of the one
>> being changed:
>>
>>          maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
>>          maxtuples = Min(maxtuples, INT_MAX);
>> -        maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
>> +        maxtuples = Min(maxtuples, MaxAllocHugeSize / sizeof(ItemPointerData));
>>
>> Regardless of what vac_work_mem is, we aren't gonna let you have more
>> than INT_MAX ItemPointers, hence 12GB at the most.  So the worst-case
>> increase from the patch as given is 12X.  Maybe that's enough to cause
>> bad consequences on some systems, but it's not the sort of disaster
>> Robert posits above.
>
> Is there a reason we can't use repalloc here?

There are two possible problems, either of which is necessarily fatal:

1. I expect repalloc probably works by allocating the new space,
copying from old to new, and freeing the old.  That could work out
badly if we are nearly the edge of the system's allocation limit.

2. It's slower than the approach proposed upthread of allocating the
array in segments.  With that approach, we never need to memcpy()
anything.

On the plus side, it's probably less code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Tue, Sep 6, 2016 at 2:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> There are two possible problems, either of which is necessarily fatal:

I meant to write "neither of which" not "either of which".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Is there a reason we can't use repalloc here?

(1) repalloc will probably copy the data.

(2) that answer doesn't excuse you from choosing a limit.

We could get around (1) by something like Robert's idea of segmented
allocation, but TBH I've seen nothing on this thread to make me think
it's necessary or would even result in any performance improvement
at all.  The bigger we make that array, the worse index-cleaning
is going to perform, and complicating the data structure will add
another hit on top of that.
        regards, tom lane



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Tue, Sep 6, 2016 at 3:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> We could get around (1) by something like Robert's idea of segmented
> allocation, but TBH I've seen nothing on this thread to make me think
> it's necessary or would even result in any performance improvement
> at all.  The bigger we make that array, the worse index-cleaning
> is going to perform, and complicating the data structure will add
> another hit on top of that.

I wouldn't be so sure, I've seen cases where two binary searches were
faster than a single binary search, especially when working with
humongus arrays like this tid array, because touching less (memory)
pages for a search does pay off considerably.

I'd try before giving up on the idea.

The test results (which I'll post in a second) do give credit to your
expectation that making the array bigger/more complex does impact
index scan performance. It's still faster than scanning several times
though.



Re: Vacuum: allow usage of more than 1GB of work mem

From
Simon Riggs
Date:
On 6 September 2016 at 19:09, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Sep 6, 2016 at 2:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 6 September 2016 at 19:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>> Yeah, but I've seen actual breakage from exactly this issue on
>>>> customer systems even with the 1GB limit, and when we start allowing
>>>> 100GB it's going to get a whole lot worse.
>>>
>>> While it's not necessarily a bad idea to consider these things,
>>> I think people are greatly overestimating the consequences of the
>>> patch-as-proposed.  AFAICS, it does *not* let you tell VACUUM to
>>> eat 100GB of workspace.  Note the line right in front of the one
>>> being changed:
>>>
>>>          maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
>>>          maxtuples = Min(maxtuples, INT_MAX);
>>> -        maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
>>> +        maxtuples = Min(maxtuples, MaxAllocHugeSize / sizeof(ItemPointerData));
>>>
>>> Regardless of what vac_work_mem is, we aren't gonna let you have more
>>> than INT_MAX ItemPointers, hence 12GB at the most.  So the worst-case
>>> increase from the patch as given is 12X.  Maybe that's enough to cause
>>> bad consequences on some systems, but it's not the sort of disaster
>>> Robert posits above.
>>
>> Is there a reason we can't use repalloc here?
>
> There are two possible problems, either of which is necessarily fatal:
>
> 1. I expect repalloc probably works by allocating the new space,
> copying from old to new, and freeing the old.  That could work out
> badly if we are nearly the edge of the system's allocation limit.
>
> 2. It's slower than the approach proposed upthread of allocating the
> array in segments.  With that approach, we never need to memcpy()
> anything.
>
> On the plus side, it's probably less code.

Hmm, OK.

What occurs to me is that we can exactly predict how many tuples we
are going to get when we autovacuum, since we measure that and we know
what the number is when we trigger it.

So there doesn't need to be any guessing going on at all, nor do we
need it to be flexible.

My proposal now is to pass in the number of rows changed since last
vacuum and use that (+10% to be safe) as the size of the array, up to
the defined limit.

Manual VACUUM still needs to guess, so we might need a flexible
solution there, but generally we don't. We could probably estimate it
from the VM.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Tue, Sep 6, 2016 at 2:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> What occurs to me is that we can exactly predict how many tuples we
> are going to get when we autovacuum, since we measure that and we know
> what the number is when we trigger it.
>
> So there doesn't need to be any guessing going on at all, nor do we
> need it to be flexible.

No, that's not really true.  A lot can change between the time it's
triggered and the time it happens, or even while it's happening.
Somebody can run a gigantic bulk delete just after we start the
VACUUM.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Tue, Sep 6, 2016 at 3:45 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 5 September 2016 at 21:58, Claudio Freire <klaussfreire@gmail.com> wrote:
>
>>>>> How long does that part ever take? Is there any substantial gain from this?
>
>> Btw, without a further patch to prefetch pages on the backward scan
>> for truncate, however, my patience ran out before it finished
>> truncating. I haven't submitted that patch because there was an
>> identical patch in an older thread that was discussed and more or less
>> rejected since it slightly penalized SSDs.
>
> OK, thats enough context. Sorry for being forgetful on that point.
>
> Please post that new patch also.

Attached.

On Mon, Sep 5, 2016 at 5:58 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Mon, Sep 5, 2016 at 5:36 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 5 September 2016 at 15:50, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> On 3 September 2016 at 04:25, Claudio Freire <klaussfreire@gmail.com> wrote:
>>>>> The patch also makes vacuum free the dead_tuples before starting
>>>>> truncation. It didn't seem necessary to hold onto it beyond that
>>>>> point, and it might help give the OS more cache, especially if work
>>>>> mem is configured very high to avoid multiple index scans.
>>>>
>>>> How long does that part ever take? Is there any substantial gain from this?
>>>>
>>>> Lets discuss that as a potential second patch.
>>>
>>> In the test case I mentioned, it takes longer than the vacuum part itself.
>>
>> Please provide a test case and timings so we can see what's happening.

Robert made a strong point for a change in the approach, so the
information below is applicable only to the old patch (to be
rewritten).

I'm sending this merely to document the testing done, it will be a
while before I can get the proposed design running and tested.

> The referenced test case is the one I mentioned on the OP:
>
> - createdb pgbench
> - pgbench -i -s 4000 pgbench
> - psql pgbench -c 'delete from pgbench_accounts;'
> - vacuumdb -v -t pgbench_accounts pgbench
>
> fsync=off, autovacuum=off, maintainance_work_mem=4GB
>
> From what I remember, it used ~2.7GB of RAM up until the truncate
> phase, where it freed it. It performed a single index scan over the
> PK.
>
> I don't remember timings, and I didn't take them, so I'll have to
> repeat the test to get them. It takes all day and makes my laptop
> unusably slow, so I'll post them later, but they're not very
> interesting. The only interesting bit is that it does a single index
> scan instead of several, which on TB-or-more tables it's kinda nice.


So, the test results below:

During setup, maybe for context, the delete took 52m 50s real time
(measured with time psql pgbench -c 'delete from pgbench_accounts;')

During the delete, my I/O was on average like the following, which
should give an indication of what my I/O subsystem is capable of (not
much, granted):

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.47     5.27   35.53   77.42    17.58    42.95
1097.51   145.22 1295.23   33.47 1874.36   8.85 100.00

Since it's a 5k RPM laptop drive, it's rather slow on IOPS, and since
I'm using the defaults for shared buffers and checkpoints, write
thoughput isn't stellar either. But that's not the point of the test
anyway, it's just for context.

The hardware is an HP envy laptop with a 1TB 5.4k RPM hard drive, 12GB
RAM, core i7-4722HQ, no weird performance tweaking of any kind (ie:
cpu scaling left intact). The system was not dedicated of course,
being a laptop, but it had little else going on while the test was
running. Given the size of the test, I don't believe there's any
chance concurrent activity could invalidate the results.

The timing for setup was comparable with both versions (patched and
unpatched), so I'm reporting the patched times only.


The vacuum phase:

patched:

$ vacuumdb -v -t pgbench_accounts pgbench
INFO:  vacuuming "public.pgbench_accounts"
INFO:  scanned index "pgbench_accounts_pkey" to remove 400000000 row versions
DETAIL:  CPU 12.46s/48.76u sec elapsed 566.47 sec.
INFO:  "pgbench_accounts": removed 400000000 row versions in 6557378 pages
DETAIL:  CPU 56.68s/28.90u sec elapsed 1872.76 sec.
INFO:  index "pgbench_accounts_pkey" now contains 0 row versions in
1096762 pages
DETAIL:  400000000 index row versions were removed.
1092896 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.47 sec.
INFO:  "pgbench_accounts": found 400000000 removable, 0 nonremovable
row versions in 6557378 out of 6557378 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 129.24s/127.24u sec elapsed 3877.13 sec.
INFO:  "pgbench_accounts": truncated 6557378 to 0 pages
DETAIL:  CPU 34.88s/7.91u sec elapsed 1645.90 sec.

Total elapsed time: ~92 minutes

I/O during initial heap scan:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               1.52    99.78   72.63   62.47    31.94    33.22
987.80   146.71 1096.29   25.39 2341.48   7.40 100.00

Index scan:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               7.08     3.87   55.18   59.87    17.06    31.83
870.33   146.61 1243.34   31.42 2360.44   8.69 100.00

Final heap scan:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.78     8.65   65.32   57.32    31.50    32.96
1076.56   152.22 1928.67 1410.63 2519.01   8.15 100.00

Truncate (with prefetch):

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda             159.67     0.87 3720.03    2.82    30.31     0.12
16.74    19.11    5.13    4.95  242.60   0.27  99.23

Without prefetch, rMB/s during truncation varies between 4MB/s and
6MB/s, so it's on average 6 times slower, meaning it would take over 3
hours.

Peak memory used: 2369MB RSS, 4260MB VIRT  (source: top)

Unpatched + prefetch (same config, effective work mem 1GB due to
non-huge allocation limit):

$ vacuumdb -v -t pgbench_accounts pgbench
INFO:  vacuuming "public.pgbench_accounts"
INFO:  scanned index "pgbench_accounts_pkey" to remove 178956737 row versions
DETAIL:  CPU 5.88s/53.77u sec elapsed 263.63 sec.
INFO:  "pgbench_accounts": removed 178956737 row versions in 2933717 pages
DETAIL:  CPU 22.28s/12.94u sec elapsed 757.45 sec.
INFO:  scanned index "pgbench_accounts_pkey" to remove 178956737 row versions
DETAIL:  CPU 7.44s/31.28u sec elapsed 282.41 sec.
INFO:  "pgbench_accounts": removed 178956737 row versions in 2933717 pages
DETAIL:  CPU 22.24s/13.30u sec elapsed 806.54 sec.
INFO:  scanned index "pgbench_accounts_pkey" to remove 42086526 row versions
DETAIL:  CPU 4.30s/5.83u sec elapsed 170.30 sec.
INFO:  "pgbench_accounts": removed 42086526 row versions in 689944 pages
DETAIL:  CPU 3.35s/3.23u sec elapsed 126.22 sec.
INFO:  index "pgbench_accounts_pkey" now contains 0 row versions in
1096762 pages
DETAIL:  400000000 index row versions were removed.
1096351 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.40 sec.
INFO:  "pgbench_accounts": found 400000000 removable, 0 nonremovable
row versions in 6557378 out of 6557378 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 123.82s/183.76u sec elapsed 4071.54 sec.
INFO:  "pgbench_accounts": truncated 6557378 to 0 pages
DETAIL:  CPU 40.36s/7.72u sec elapsed 1648.22 sec.

Total elapsed time:  ~95m

I/O during initial heap scan:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               1.48    32.53   66.10   60.50    31.95    34.88
1081.06   149.20 1175.78   25.44 2432.59   8.02 101.59

First index scan:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               1.17    14.95   43.85   70.07    19.65    40.18
1075.57   145.98 1278.39   31.86 2058.51   8.78 100.00

Final index scan:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda              17.12     1.50  169.85    2.28    68.33     0.67
820.95   158.32  312.00   28.14 21426.95   5.81 100.00

Truncation:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda             142.93     1.23 3444.70    4.65    28.16     0.36
16.93    18.52    5.37    5.25   91.17   0.29  99.22

Peak memory used is 1135MB RSS and 1188MB VIRT

Comparing:

Time reportedly spent scanning indexes: 716.34 unpatched, 566.47 patched
Time reportedly spent scanning heap: 1690.21 unpatched, 1872.76 patched
Total vacuum scan as reported: 4071.54 unpatched, 3877.13 patched

Surely I didn't expect it to be such a close call. I believe the key
reason is the speedup it got during the final index scan for not
having to delete so many tuples. Clearly, having to interleave reads
and writes is stressing my HD, and the last index scan, having to
write less, was thus faster.

I don't believe this would have happened if the index hadn't been
pristine and in almost physical (heap) order, so I'd expect real-world
cases (with properly aged, shuffled and bloated indexes) to show a
more pronounced difference. Or when using a cost limit, that will
artificially limit the I/O rate vacuum can reach.

Clearly the patch is of use when I/O is the limiting factor, either
due to vacuum cost limits, or due to the I/O subsystem being the
bottleneck, as was the case during the above test case. Since more
work mem will mean a slower lookup of the dead_tuples array, not only
due to the extra comparisons but also poorer cache locality, I believe
it won't benefit the runtime cost of CPU-bound cases, but it should at
least generate less WAL since that's another benefit of scanning the
indexes fewer times (increased WAL rates during vacuum is another
problem we regularly face in our production setup).

Given the I/O subsystem on my test machine isn't able to produce a
CPU-bound test case for the amount of dead_tuples involved in
stressing the patch, I cannot confirm the above statement, but it
should be evident given the implementation.

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Simon Riggs
Date:
On 6 September 2016 at 19:23, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Sep 6, 2016 at 2:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> What occurs to me is that we can exactly predict how many tuples we
>> are going to get when we autovacuum, since we measure that and we know
>> what the number is when we trigger it.
>>
>> So there doesn't need to be any guessing going on at all, nor do we
>> need it to be flexible.
>
> No, that's not really true.  A lot can change between the time it's
> triggered and the time it happens, or even while it's happening.
> Somebody can run a gigantic bulk delete just after we start the
> VACUUM.

Which wouldn't be removed by the VACUUM, so can be ignored.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Tue, Sep 6, 2016 at 2:51 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 6 September 2016 at 19:23, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Sep 6, 2016 at 2:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> What occurs to me is that we can exactly predict how many tuples we
>>> are going to get when we autovacuum, since we measure that and we know
>>> what the number is when we trigger it.
>>>
>>> So there doesn't need to be any guessing going on at all, nor do we
>>> need it to be flexible.
>>
>> No, that's not really true.  A lot can change between the time it's
>> triggered and the time it happens, or even while it's happening.
>> Somebody can run a gigantic bulk delete just after we start the
>> VACUUM.
>
> Which wouldn't be removed by the VACUUM, so can be ignored.

OK, true.  But I still think it's very unlikely that we can calculate
an exact count of how many dead tuples we might run into.  I think we
shouldn't rely on the stats collector to be perfectly correct anyway -
for one thing, you can turn it off - and instead cope with the
uncertainty.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On 6 September 2016 at 19:23, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Sep 6, 2016 at 2:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> What occurs to me is that we can exactly predict how many tuples we
>>> are going to get when we autovacuum, since we measure that and we know
>>> what the number is when we trigger it.
>>> So there doesn't need to be any guessing going on at all, nor do we
>>> need it to be flexible.

>> No, that's not really true.  A lot can change between the time it's
>> triggered and the time it happens, or even while it's happening.
>> Somebody can run a gigantic bulk delete just after we start the
>> VACUUM.

> Which wouldn't be removed by the VACUUM, so can be ignored.

(1) If the delete commits just before the vacuum starts, it may be
removable.  I think you're nuts to imagine there are no race conditions
here.

(2) Stats from the stats collector never have been, and likely never will
be, anything but approximate.  That goes double for dead-tuple counts,
which are inaccurate even as sent from backends, never mind the multiple
ways that the collector might lose the counts.

The idea of looking to the stats to *guess* about how many tuples are
removable doesn't seem bad at all.  But imagining that that's going to be
exact is folly of the first magnitude.
        regards, tom lane



Re: Vacuum: allow usage of more than 1GB of work mem

From
Simon Riggs
Date:
On 6 September 2016 at 19:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> The idea of looking to the stats to *guess* about how many tuples are
> removable doesn't seem bad at all.  But imagining that that's going to be
> exact is folly of the first magnitude.

Yes.  Bear in mind I had already referred to allowing +10% to be safe,
so I think we agree that a reasonably accurate, yet imprecise
calculation is possible in most cases.

If a recent transaction has committed, we will see both committed dead
rows and stats to show they exist. I'm sure there are corner cases and
race conditions where a major effect (greater than 10%) could occur,
in which case we run the index scan more than once, just as we do now.

The attached patch raises the limits as suggested by Claudio, allowing
for larger memory allocations if possible, yet limits the allocation
for larger tables based on the estimate gained from pg_stats, while
adding 10% for caution.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Greg Stark
Date:
On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 6 September 2016 at 19:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> The idea of looking to the stats to *guess* about how many tuples are
>> removable doesn't seem bad at all.  But imagining that that's going to be
>> exact is folly of the first magnitude.
>
> Yes.  Bear in mind I had already referred to allowing +10% to be safe,
> so I think we agree that a reasonably accurate, yet imprecise
> calculation is possible in most cases.

That would all be well and good if it weren't trivial to do what
Robert suggested. This is just a large unsorted list that we need to
iterate throught. Just allocate chunks of a few megabytes and when
it's full allocate a new chunk and keep going. There's no need to get
tricky with estimates and resizing and whatever.

-- 
greg



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark <stark@mit.edu> wrote:
> On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 6 September 2016 at 19:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>>> The idea of looking to the stats to *guess* about how many tuples are
>>> removable doesn't seem bad at all.  But imagining that that's going to be
>>> exact is folly of the first magnitude.
>>
>> Yes.  Bear in mind I had already referred to allowing +10% to be safe,
>> so I think we agree that a reasonably accurate, yet imprecise
>> calculation is possible in most cases.
>
> That would all be well and good if it weren't trivial to do what
> Robert suggested. This is just a large unsorted list that we need to
> iterate throught. Just allocate chunks of a few megabytes and when
> it's full allocate a new chunk and keep going. There's no need to get
> tricky with estimates and resizing and whatever.

I agree. While the idea of estimating the right size sounds promising
a priori, considering the estimate can go wrong and over or
underallocate quite severely, the risks outweigh the benefits when you
consider the alternative of a dynamic allocation strategy.

Unless the dynamic strategy has a bigger CPU impact than expected, I
believe it's a superior approach.



Re: Vacuum: allow usage of more than 1GB of work mem

From
Masahiko Sawada
Date:
On Wed, Sep 7, 2016 at 2:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Sep 6, 2016 at 10:28 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> The problem with this is that we allocate the entire amount of
>>> maintenance_work_mem even when the number of actual dead tuples turns
>>> out to be very small.  That's not so bad if the amount of memory we're
>>> potentially wasting is limited to ~1 GB, but it seems pretty dangerous
>>> to remove the 1 GB limit, because somebody might have
>>> maintenance_work_mem set to tens or hundreds of gigabytes to speed
>>> index creation, and allocating that much space for a VACUUM that
>>> encounters 1 dead tuple does not seem like a good plan.
>>>
>>> What I think we need to do is make some provision to initially
>>> allocate only a small amount of memory and then grow the allocation
>>> later if needed.  For example, instead of having
>>> vacrelstats->dead_tuples be declared as ItemPointer, declare it as
>>> ItemPointer * and allocate the array progressively in segments.  I'd
>>> actually argue that the segment size should be substantially smaller
>>> than 1 GB, like say 64MB; there are still some people running systems
>>> which are small enough that allocating 1 GB when we may need only 6
>>> bytes can drive the system into OOM.
>>
>> This would however incur the cost of having to copy the whole GB-sized
>> chunk every time it's expanded. It woudln't be cheap.
>
> No, I don't want to end up copying the whole array; that's what I
> meant by allocating it progressively in segments.  Something like what
> you go on to propose.
>
>> I've monitored the vacuum as it runs and the OS doesn't map the whole
>> block unless it's touched, which it isn't until dead tuples are found.
>> Surely, if overcommit is disabled (as it should), it could exhaust the
>> virtual address space if set very high, but it wouldn't really use the
>> memory unless it's needed, it would merely reserve it.
>
> Yeah, but I've seen actual breakage from exactly this issue on
> customer systems even with the 1GB limit, and when we start allowing
> 100GB it's going to get a whole lot worse.
>
>> To fix that, rather than repalloc the whole thing, dead_tuples would
>> have to be an ItemPointer** of sorted chunks. That'd be a
>> significantly more complex patch, but at least it wouldn't incur the
>> memcpy.
>
> Right, this is what I had in mind.  I don't think this is actually
> very complicated, because the way we use this array is really simple.
> We basically just keep appending to the array until we run out of
> space, and that's not very hard to implement with an array-of-arrays.
> The chunks are, in some sense, sorted, as you say, but you don't need
> to do qsort() or anything like that.  You're just replacing a single
> flat array with a data structure that can be grown incrementally in
> fixed-size chunks.
>

If we replaced dead_tuples with an array-of-array, isn't there
negative performance impact for lazy_tid_reap()?
As chunk is added, that performance would be decrease.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Vacuum: allow usage of more than 1GB of work mem

From
Jim Nasby
Date:
On 9/8/16 3:48 AM, Masahiko Sawada wrote:
> If we replaced dead_tuples with an array-of-array, isn't there
> negative performance impact for lazy_tid_reap()?
> As chunk is added, that performance would be decrease.

Yes, it certainly would, as you'd have to do 2 binary searches. I'm not 
sure how much that matters though; presumably the index scans are 
normally IO-bound?

Another option would be to use the size estimation ideas others have 
mentioned to create one array. If the estimates prove to be wrong you 
could then create a single additional segment; by that point you should 
have a better idea of how far off the original estimate was. That means 
the added search cost would only be a compare and a second pointer redirect.

Something else that occurred to me... AFAIK the only reason we don't 
support syncscan with VACUUM is because it would require sorting the TID 
list. If we just added a second TID list we would be able to support 
syncscan, swapping over to the 'low' list when we hit the end of the 
relation.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461



Re: Vacuum: allow usage of more than 1GB of work mem

From
Pavan Deolasee
Date:


On Wed, Sep 7, 2016 at 10:18 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark <stark@mit.edu> wrote:
> On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 6 September 2016 at 19:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>>> The idea of looking to the stats to *guess* about how many tuples are
>>> removable doesn't seem bad at all.  But imagining that that's going to be
>>> exact is folly of the first magnitude.
>>
>> Yes.  Bear in mind I had already referred to allowing +10% to be safe,
>> so I think we agree that a reasonably accurate, yet imprecise
>> calculation is possible in most cases.
>
> That would all be well and good if it weren't trivial to do what
> Robert suggested. This is just a large unsorted list that we need to
> iterate throught. Just allocate chunks of a few megabytes and when
> it's full allocate a new chunk and keep going. There's no need to get
> tricky with estimates and resizing and whatever.

I agree. While the idea of estimating the right size sounds promising
a priori, considering the estimate can go wrong and over or
underallocate quite severely, the risks outweigh the benefits when you
consider the alternative of a dynamic allocation strategy.

Unless the dynamic strategy has a bigger CPU impact than expected, I
believe it's a superior approach.


How about a completely different representation for the TID array? Now this is probably not something new, but I couldn't find if the exact same idea was discussed before. I also think it's somewhat orthogonal to what we are trying to do here, and will probably be a bigger change. But I thought I'll mention since we are at the topic.

What I propose is to use a simple bitmap to represent the tuples. If a tuple at <block, offset> is dead then the corresponding bit in the bitmap is set. So clearly the searches through dead tuples are O(1) operations, important for very large tables and large arrays.

Challenge really is that a heap page can theoretically have MaxOffsetNumber of line pointers (or to be precise maximum possible offset number). For a 8K block, that comes be about 2048. Having so many bits per page is neither practical nor optimal. But in practice the largest offset on a heap page should not be significantly greater than MaxHeapTuplesPerPage, which is a more reasonable value of 291 on my machine. Again, that's with zero sized tuple and for real life large tables, with much wider tuples, the number may go down even further.

So we cap the offsets represented in the bitmap to some realistic value, computed by looking at page density and then multiplying it by a small factor (not more than two) to take into account LP_DEAD and LP_REDIRECT line pointers. That should practically represent majority of the dead tuples in the table, but we then keep an overflow area to record tuples beyond the limit set for per page. The search routine will do a direct lookup for offsets less than the limit and search in the sorted overflow area for offsets beyond the limit.

For example, for a table with 60 bytes wide tuple (including 24 byte header), each page can approximately have 8192/60 = 136 tuples. Say we provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the bitmap. First 272 offsets in every page are represented in the bitmap and anything greater than are in overflow region. On the other hand, the current representation will need about 16 bytes per page assuming 2% dead tuples, 40 bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead tuples. So bitmap will take more space for small tuples or when vacuum is run very aggressively, both seems unlikely for very large tables. Of course the calculation does not take into account the space needed by the overflow area, but I expect that too be small.

I guess we can make a choice between two representations at the start looking at various table stats. We can also be smart and change from bitmap to traditional representation as we scan the table and see many more tuples in the overflow region than we provisioned for. There will be some challenges in converting representation mid-way, especially in terms of memory allocation, but I think those can be sorted out if we think that the idea has merit.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Sep 8, 2016 at 11:54 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> For example, for a table with 60 bytes wide tuple (including 24 byte
> header), each page can approximately have 8192/60 = 136 tuples. Say we
> provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
> bitmap. First 272 offsets in every page are represented in the bitmap and
> anything greater than are in overflow region. On the other hand, the current
> representation will need about 16 bytes per page assuming 2% dead tuples, 40
> bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead
> tuples. So bitmap will take more space for small tuples or when vacuum is
> run very aggressively, both seems unlikely for very large tables. Of course
> the calculation does not take into account the space needed by the overflow
> area, but I expect that too be small.

I thought about something like this, but it could be extremely
inefficient for mostly frozen tables, since the bitmap cannot account
for frozen pages without losing the O(1) lookup characteristic



Re: Vacuum: allow usage of more than 1GB of work mem

From
Pavan Deolasee
Date:


On Thu, Sep 8, 2016 at 8:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Thu, Sep 8, 2016 at 11:54 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> For example, for a table with 60 bytes wide tuple (including 24 byte
> header), each page can approximately have 8192/60 = 136 tuples. Say we
> provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
> bitmap. First 272 offsets in every page are represented in the bitmap and
> anything greater than are in overflow region. On the other hand, the current
> representation will need about 16 bytes per page assuming 2% dead tuples, 40
> bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead
> tuples. So bitmap will take more space for small tuples or when vacuum is
> run very aggressively, both seems unlikely for very large tables. Of course
> the calculation does not take into account the space needed by the overflow
> area, but I expect that too be small.

I thought about something like this, but it could be extremely
inefficient for mostly frozen tables, since the bitmap cannot account
for frozen pages without losing the O(1) lookup characteristic

Well, that's correct. But I thought the whole point is when there are large number of dead tuples which requires large memory. If my math was correct as explained above, then even at 5% dead tuples, bitmap representation will consume approximate same memory but provide O(1) search time.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Vacuum: allow usage of more than 1GB of work mem

From
Masahiko Sawada
Date:
On Thu, Sep 8, 2016 at 11:54 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
>
> On Wed, Sep 7, 2016 at 10:18 PM, Claudio Freire <klaussfreire@gmail.com>
> wrote:
>>
>> On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark <stark@mit.edu> wrote:
>> > On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs <simon@2ndquadrant.com>
>> > wrote:
>> >> On 6 September 2016 at 19:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >>
>> >>> The idea of looking to the stats to *guess* about how many tuples are
>> >>> removable doesn't seem bad at all.  But imagining that that's going to
>> >>> be
>> >>> exact is folly of the first magnitude.
>> >>
>> >> Yes.  Bear in mind I had already referred to allowing +10% to be safe,
>> >> so I think we agree that a reasonably accurate, yet imprecise
>> >> calculation is possible in most cases.
>> >
>> > That would all be well and good if it weren't trivial to do what
>> > Robert suggested. This is just a large unsorted list that we need to
>> > iterate throught. Just allocate chunks of a few megabytes and when
>> > it's full allocate a new chunk and keep going. There's no need to get
>> > tricky with estimates and resizing and whatever.
>>
>> I agree. While the idea of estimating the right size sounds promising
>> a priori, considering the estimate can go wrong and over or
>> underallocate quite severely, the risks outweigh the benefits when you
>> consider the alternative of a dynamic allocation strategy.
>>
>> Unless the dynamic strategy has a bigger CPU impact than expected, I
>> believe it's a superior approach.
>>
>
> How about a completely different representation for the TID array? Now this
> is probably not something new, but I couldn't find if the exact same idea
> was discussed before. I also think it's somewhat orthogonal to what we are
> trying to do here, and will probably be a bigger change. But I thought I'll
> mention since we are at the topic.
>
> What I propose is to use a simple bitmap to represent the tuples. If a tuple
> at <block, offset> is dead then the corresponding bit in the bitmap is set.
> So clearly the searches through dead tuples are O(1) operations, important
> for very large tables and large arrays.
>
> Challenge really is that a heap page can theoretically have MaxOffsetNumber
> of line pointers (or to be precise maximum possible offset number). For a 8K
> block, that comes be about 2048. Having so many bits per page is neither
> practical nor optimal. But in practice the largest offset on a heap page
> should not be significantly greater than MaxHeapTuplesPerPage, which is a
> more reasonable value of 291 on my machine. Again, that's with zero sized
> tuple and for real life large tables, with much wider tuples, the number may
> go down even further.
>
> So we cap the offsets represented in the bitmap to some realistic value,
> computed by looking at page density and then multiplying it by a small
> factor (not more than two) to take into account LP_DEAD and LP_REDIRECT line
> pointers. That should practically represent majority of the dead tuples in
> the table, but we then keep an overflow area to record tuples beyond the
> limit set for per page. The search routine will do a direct lookup for
> offsets less than the limit and search in the sorted overflow area for
> offsets beyond the limit.
>
> For example, for a table with 60 bytes wide tuple (including 24 byte
> header), each page can approximately have 8192/60 = 136 tuples. Say we
> provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
> bitmap. First 272 offsets in every page are represented in the bitmap and
> anything greater than are in overflow region. On the other hand, the current
> representation will need about 16 bytes per page assuming 2% dead tuples, 40
> bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead
> tuples. So bitmap will take more space for small tuples or when vacuum is
> run very aggressively, both seems unlikely for very large tables. Of course
> the calculation does not take into account the space needed by the overflow
> area, but I expect that too be small.
>
> I guess we can make a choice between two representations at the start
> looking at various table stats. We can also be smart and change from bitmap
> to traditional representation as we scan the table and see many more tuples
> in the overflow region than we provisioned for. There will be some
> challenges in converting representation mid-way, especially in terms of
> memory allocation, but I think those can be sorted out if we think that the
> idea has merit.
>

Making the vacuum possible to choose between two data representations
sounds good.
I implemented the patch that changes dead tuple representation to bitmap before.
I will measure the performance of bitmap representation again and post them.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Vacuum: allow usage of more than 1GB of work mem

From
Pavan Deolasee
Date:


On Thu, Sep 8, 2016 at 11:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:


Making the vacuum possible to choose between two data representations
sounds good.
I implemented the patch that changes dead tuple representation to bitmap before.
I will measure the performance of bitmap representation again and post them.

Sounds great! I haven't seen your patch, but what I would suggest is to compute page density (D) = relpages/(dead+live tuples) and experiment with bitmap of sizes of D to 2D bits per page. May I also suggest that instead of putting in efforts in implementing the overflow area,  just count how many dead TIDs would fall under overflow area for a given choice of bitmap size.

It might be a good idea to experiment with different vacuum scale factor, varying between 2% to 20% (may be 2, 5, 10, 20). You can probably run a longish pgbench test on a large table and then save the data directory for repeated experiments, although I'm not sure if pgbench will be a good choice because HOT will prevent accumulation of dead pointers, in which case you may try adding another index on abalance column.

It'll be worth measuring memory consumption of both representations as well as performance implications on index vacuum. I don't expect to see any major difference in either heap scans.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Vacuum: allow usage of more than 1GB of work mem

From
Masahiko Sawada
Date:
On Fri, Sep 9, 2016 at 12:33 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
>
> On Thu, Sep 8, 2016 at 11:40 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>>
>>
>> Making the vacuum possible to choose between two data representations
>> sounds good.
>> I implemented the patch that changes dead tuple representation to bitmap
>> before.
>> I will measure the performance of bitmap representation again and post
>> them.
>
>
> Sounds great! I haven't seen your patch, but what I would suggest is to
> compute page density (D) = relpages/(dead+live tuples) and experiment with
> bitmap of sizes of D to 2D bits per page. May I also suggest that instead of
> putting in efforts in implementing the overflow area,  just count how many
> dead TIDs would fall under overflow area for a given choice of bitmap size.
>

Isn't that formula "page density (D) = (dead+live tuples)/relpages"?

> It might be a good idea to experiment with different vacuum scale factor,
> varying between 2% to 20% (may be 2, 5, 10, 20). You can probably run a
> longish pgbench test on a large table and then save the data directory for
> repeated experiments, although I'm not sure if pgbench will be a good choice
> because HOT will prevent accumulation of dead pointers, in which case you
> may try adding another index on abalance column.

Thank you, I will experiment with this.

>
> It'll be worth measuring memory consumption of both representations as well
> as performance implications on index vacuum. I don't expect to see any major
> difference in either heap scans.
>

Yeah, it would be effective for the index vacuum speed and the number
of execution of index vacuum.

Attached PoC patch changes the representation of dead tuple locations
to the hashmap having tuple bitmap.
The one hashmap entry consists of the block number and the TID bitmap
of corresponding block, and the block number is the hash key of
hashmap.
Current implementation of this patch is not smart yet because each
hashmap entry allocates the tuple bitmap with fixed
size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
In case where one block can store only the several tens tuples, the
most bits are would be waste.

After improved this patch as you suggested, I will measure performance benefit.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached PoC patch changes the representation of dead tuple locations
> to the hashmap having tuple bitmap.
> The one hashmap entry consists of the block number and the TID bitmap
> of corresponding block, and the block number is the hash key of
> hashmap.
> Current implementation of this patch is not smart yet because each
> hashmap entry allocates the tuple bitmap with fixed
> size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
> LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
> In case where one block can store only the several tens tuples, the
> most bits are would be waste.
>
> After improved this patch as you suggested, I will measure performance benefit.

We also need to consider the amount of memory gets used.  What I
proposed - replacing the array with an array of arrays - would not
increase memory utilization significantly.  I don't think it would
have much impact on CPU utilization either.  It would require
replacing the call to bsearch() in lazy_heap_reaptid() with an
open-coded implementation of bsearch, or with one bsearch to find the
chunk and another to find the TID within the chunk, but that shouldn't
be very expensive.  For one thing, if the array chunks are around the
size I proposed (64MB), you've got more than ten million tuples per
chunk, so you can't have very many chunks unless your table is both
really large and possessed of quite a bit of dead stuff.

Now, if I'm reading it correctly, this patch allocates a 132-byte
structure for every page with at least one dead tuple.  In the worst
case where there's just one dead tuple per page, that's a 20x
regression in memory usage.  Actually, it's more like 40x, because
AllocSetAlloc rounds small allocation sizes up to the next-higher
power of two, which really stings for a 132-byte allocation, and then
adds a 16-byte header to each chunk.  But even 20x is clearly not
good.  There are going to be lots of real-world cases where this uses
way more memory to track the same number of dead tuples, and I'm
guessing that benchmarking is going to reveal that it's not faster,
either.

I think it's probably wrong to worry that an array-of-arrays is going
to be meaningfully slower than a single array here.  It's basically
costing you some small number of additional memory references per
tuple, which I suspect isn't all that relevant for a bulk operation
that does I/O, writes WAL, locks buffers, etc.  But if it is relevant,
then I think there are other ways to buy that performance back which
are likely to be more memory efficient than converting this to use a
hash table.  For example, we could keep a bitmap with one bit per K
pages.  If the bit is set, there is at least 1 dead tuple on that
page; if clear, there are none.  When we see an index tuple, we
consult the bitmap to determine whether we need to search the TID
list.  We select K to be the smallest power of 2 such that the bitmap
uses less memory than some threshold, perhaps 64kB.  Assuming that
updates and deletes to the table have some locality, we should be able
to skip a large percentage of the TID searches with a probe into this
very compact bitmap.  Note that we can set K = 1 for tables up to 4GB
in size, and even a 1TB table only needs K = 256.  Odds are very good
that a 1TB table being vacuumed has many 256-page ranges containing no
dead tuples at all ... and if this proves to be false and the dead
tuples are scattered uniformly throughout the table, then you should
probably be more worried about the fact that you're dumping a bare
minimum of 4GB of random I/O on your hapless disk controller than
about how efficient the TID search is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Peter Geoghegan
Date:
On Tue, Sep 13, 2016 at 11:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think it's probably wrong to worry that an array-of-arrays is going
> to be meaningfully slower than a single array here.  It's basically
> costing you some small number of additional memory references per
> tuple, which I suspect isn't all that relevant for a bulk operation
> that does I/O, writes WAL, locks buffers, etc.

This analysis makes perfect sense to me.


-- 
Peter Geoghegan



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Tue, Sep 13, 2016 at 3:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Attached PoC patch changes the representation of dead tuple locations
>> to the hashmap having tuple bitmap.
>> The one hashmap entry consists of the block number and the TID bitmap
>> of corresponding block, and the block number is the hash key of
>> hashmap.
>> Current implementation of this patch is not smart yet because each
>> hashmap entry allocates the tuple bitmap with fixed
>> size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
>> LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
>> In case where one block can store only the several tens tuples, the
>> most bits are would be waste.
>>
>> After improved this patch as you suggested, I will measure performance benefit.
>
> We also need to consider the amount of memory gets used.  What I
> proposed - replacing the array with an array of arrays - would not
> increase memory utilization significantly.  I don't think it would
> have much impact on CPU utilization either.

I've finished writing that patch, I'm in the process of testing its CPU impact.

First test seemed to hint at a 40% increase in CPU usage, which seems
rather steep compared to what I expected, so I'm trying to rule out
some methodology error here.

> It would require
> replacing the call to bsearch() in lazy_heap_reaptid() with an
> open-coded implementation of bsearch, or with one bsearch to find the
> chunk and another to find the TID within the chunk, but that shouldn't
> be very expensive.

I did a linear search to find the chunk, with exponentially growing
chunks, and then a bsearch to find the item inside the chunk.

With the typical number of segments and given the 12GB limit, the
segment array size is well within the range that favors linear search.

> For example, we could keep a bitmap with one bit per K
> pages.  If the bit is set, there is at least 1 dead tuple on that
> page; if clear, there are none.  When we see an index tuple, we
> consult the bitmap to determine whether we need to search the TID
> list.  We select K to be the smallest power of 2 such that the bitmap
> uses less memory than some threshold, perhaps 64kB.

I've been pondering something like that, but that's an optimization
that's quite orthogonal to the multiarray stuff.

>  Assuming that
> updates and deletes to the table have some locality, we should be able
> to skip a large percentage of the TID searches with a probe into this
> very compact bitmap.

I don't think you can assume locality



Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Tue, Sep 13, 2016 at 2:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> I've finished writing that patch, I'm in the process of testing its CPU impact.
>
> First test seemed to hint at a 40% increase in CPU usage, which seems
> rather steep compared to what I expected, so I'm trying to rule out
> some methodology error here.

Hmm, wow.  That's pretty steep.  Maybe lazy_heap_reaptid() is hotter
than I think it is, but even if it accounts for 10% of total CPU usage
within a vacuum, which seems like an awful lot, you'd have to make it
4x as expensive, which also seems like an awful lot.

>> It would require
>> replacing the call to bsearch() in lazy_heap_reaptid() with an
>> open-coded implementation of bsearch, or with one bsearch to find the
>> chunk and another to find the TID within the chunk, but that shouldn't
>> be very expensive.
>
> I did a linear search to find the chunk, with exponentially growing
> chunks, and then a bsearch to find the item inside the chunk.
>
> With the typical number of segments and given the 12GB limit, the
> segment array size is well within the range that favors linear search.

Ah, OK.

>> For example, we could keep a bitmap with one bit per K
>> pages.  If the bit is set, there is at least 1 dead tuple on that
>> page; if clear, there are none.  When we see an index tuple, we
>> consult the bitmap to determine whether we need to search the TID
>> list.  We select K to be the smallest power of 2 such that the bitmap
>> uses less memory than some threshold, perhaps 64kB.
>
> I've been pondering something like that, but that's an optimization
> that's quite orthogonal to the multiarray stuff.

Sure, but if this really does increase CPU time, it'd be reasonable to
do something to decrease it again in order to get the other benefits
of this patch - i.e. increasing the maintenance_work_mem limit while
reducing the chances that overallocation will cause OOM.

>>  Assuming that
>> updates and deletes to the table have some locality, we should be able
>> to skip a large percentage of the TID searches with a probe into this
>> very compact bitmap.
>
> I don't think you can assume locality

Really?  If you have a 1TB table, how many 2MB ranges of that table do
you think will contain dead tuples for a typical vacuum?  I think most
tables of that size are going to be mostly static, and the all-visible
and all-frozen bits are going to be mostly set.  You *could* have
something like a pgbench-type workload that does scattered updates
across the entire table, but that's going to perform pretty poorly
because you'll constantly be updating blocks that have to be pulled in
from disk.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Tue, Sep 13, 2016 at 4:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Sep 13, 2016 at 2:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> I've finished writing that patch, I'm in the process of testing its CPU impact.
>>
>> First test seemed to hint at a 40% increase in CPU usage, which seems
>> rather steep compared to what I expected, so I'm trying to rule out
>> some methodology error here.
>
> Hmm, wow.  That's pretty steep.  Maybe lazy_heap_reaptid() is hotter
> than I think it is, but even if it accounts for 10% of total CPU usage
> within a vacuum, which seems like an awful lot, you'd have to make it
> 4x as expensive, which also seems like an awful lot.

IIRC perf top reported a combined 45% between layz_heap_reaptid +
vac_cmp_itemptr (after patching).

vac_cmp_itemptr was around 15% on its own

Debug build of couse (I need the assertions and the debug symbols),
I'll retest with optimizations once debug tests make sense.

>>> For example, we could keep a bitmap with one bit per K
>>> pages.  If the bit is set, there is at least 1 dead tuple on that
>>> page; if clear, there are none.  When we see an index tuple, we
>>> consult the bitmap to determine whether we need to search the TID
>>> list.  We select K to be the smallest power of 2 such that the bitmap
>>> uses less memory than some threshold, perhaps 64kB.
>>
>> I've been pondering something like that, but that's an optimization
>> that's quite orthogonal to the multiarray stuff.
>
> Sure, but if this really does increase CPU time, it'd be reasonable to
> do something to decrease it again in order to get the other benefits
> of this patch - i.e. increasing the maintenance_work_mem limit while
> reducing the chances that overallocation will cause OOM.

I was hoping it wouldn't regress performance so much. I'd rather
micro-optimize the multiarray implementation until it doesn't and then
think of orthogonal optimizations.

>>>  Assuming that
>>> updates and deletes to the table have some locality, we should be able
>>> to skip a large percentage of the TID searches with a probe into this
>>> very compact bitmap.
>>
>> I don't think you can assume locality
>
> Really?  If you have a 1TB table, how many 2MB ranges of that table do
> you think will contain dead tuples for a typical vacuum?  I think most
> tables of that size are going to be mostly static, and the all-visible
> and all-frozen bits are going to be mostly set.  You *could* have
> something like a pgbench-type workload that does scattered updates
> across the entire table, but that's going to perform pretty poorly
> because you'll constantly be updating blocks that have to be pulled in
> from disk.

I have a few dozen of those in my biggest database. They do updates
and deletes all over the place and, even if they were few, they're
scattered almost uniformly.

Thing is, I think we really need to not worsen that case, which seems
rather common (almost any OLTP with a big enough user base, or a K-V
type of table, or TOAST tables).



Re: Vacuum: allow usage of more than 1GB of work mem

From
Pavan Deolasee
Date:


On Wed, Sep 14, 2016 at 12:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached PoC patch changes the representation of dead tuple locations
> to the hashmap having tuple bitmap.
> The one hashmap entry consists of the block number and the TID bitmap
> of corresponding block, and the block number is the hash key of
> hashmap.
> Current implementation of this patch is not smart yet because each
> hashmap entry allocates the tuple bitmap with fixed
> size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
> LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
> In case where one block can store only the several tens tuples, the
> most bits are would be waste.
>
> After improved this patch as you suggested, I will measure performance benefit.



Now, if I'm reading it correctly, this patch allocates a 132-byte
structure for every page with at least one dead tuple.  In the worst
case where there's just one dead tuple per page, that's a 20x
regression in memory usage.  Actually, it's more like 40x, because
AllocSetAlloc rounds small allocation sizes up to the next-higher
power of two, which really stings for a 132-byte allocation, and then
adds a 16-byte header to each chunk.  But even 20x is clearly not
good.  There are going to be lots of real-world cases where this uses
way more memory to track the same number of dead tuples, and I'm
guessing that benchmarking is going to reveal that it's not faster,
either.


Sawada-san offered to reimplement the patch based on what I proposed upthread. In the new scheme of things, we will allocate a fixed size bitmap of length 2D bits per page where D is average page density of live + dead tuples. (The rational behind multiplying D by a factor of 2 is to consider worst case scenario where every tuple also has a LP_DIRECT line pointer). The value of D in most real world, large tables should not go much beyond, say 100, assuming 80 bytes wide tuple and 8K blocksize. That translates to about 25 bytes/page. So all TIDs with offset less than 2D can be represented by a single bit. We augment this with an overflow to track tuples which fall outside this limit. I believe this area will be small, say 10% of the total allocation.

This representation is at least as good the current representation if there are at least 4-5% dead tuples. I don't think very large tables will be vacuumed with a scale factor less than that. And assuming 10% dead tuples, this representation will actually be much more optimal.

The idea can fail when (a) there are very few dead tuples in the table, say less than 5% or (b) there are large number of tuples falling outside the 2D limit. While I don't expect any of these to hold for real world, very large tables,  (a) we can anticipate when the vacuum starts and use current representation. (b) we can detect at run time and do a one time switch between representation. You may argue that managing two representations is clumsy, which I agree, but the code is completely isolated and probably not more than a few hundred lines.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Vacuum: allow usage of more than 1GB of work mem

From
Pavan Deolasee
Date:


On Wed, Sep 14, 2016 at 8:47 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:


Sawada-san offered to reimplement the patch based on what I proposed upthread. In the new scheme of things, we will allocate a fixed size bitmap of length 2D bits per page where D is average page density of live + dead tuples. (The rational behind multiplying D by a factor of 2 is to consider worst case scenario where every tuple also has a LP_DIRECT line pointer). The value of D in most real world, large tables should not go much beyond, say 100, assuming 80 bytes wide tuple and 8K blocksize. That translates to about 25 bytes/page. So all TIDs with offset less than 2D can be represented by a single bit. We augment this with an overflow to track tuples which fall outside this limit. I believe this area will be small, say 10% of the total allocation.


So I cooked up the attached patch to track number of live/dead tuples found at offset 1 to MaxOffsetNumber. The idea was to see how many tuples actually go beyond the threshold of 2D offsets per page. Note that I am proposing to track 2D offsets via bitmap and rest via regular TID array.

So I ran a pgbench test for 2hrs with scale factor 500. autovacuum scale factor was set to 0.1 or 10%.

Some interesting bits:

postgres=# select relname, n_tup_ins, n_tup_upd, n_tup_hot_upd, n_live_tup, n_dead_tup, pg_relation_size(relid)/8192 as relsize, (n_live_tup+n_dead_tup)/(pg_relation_size(relid)/8192) as density from pg_stat_user_tables ;
     relname      | n_tup_ins | n_tup_upd | n_tup_hot_upd | n_live_tup | n_dead_tup | relsize | density 
------------------+-----------+-----------+---------------+------------+------------+---------+---------
 pgbench_tellers  |      5000 |  95860289 |      87701578 |       5000 |          0 |    3493 |       1
 pgbench_branches |       500 |  95860289 |      94158081 |        967 |          0 |    1544 |       0
 pgbench_accounts |  50000000 |  95860289 |      93062567 |   51911657 |    3617465 |  865635 |      64
 pgbench_history  |  95860289 |         0 |             0 |   95258548 |          0 |  610598 |     156
(4 rows)

Smaller tellers and branches tables bloat so much that the density as computed by live + dead tuples falls close to 1 tuple/page. So for such tables, the idea of 2D bits/page will fail miserably. But I think these tables are worst representatives and I would be extremely surprised if we ever find very large table bloated so much. But even then, this probably tells us that we can't solely rely on the density measure.

Another interesting bit about these small tables is that the largest used offset for these tables never went beyond 291 which is the value of MaxHeapTuplesPerPage. I don't know if there is something that prevents inserting more than  MaxHeapTuplesPerPage offsets per heap page and I don't know at this point if this gives us upper limit for bits per page (may be it does).

For pgbench_accounts table, the maximum offset used was 121, though bulk of the used offsets were at the start of the page (see attached graph). Now the test did not create enough dead tuples to trigger autovacuum on pgbench_accounts table. So I ran a manul vacuum at the end. (There are about 5% dead tuples in the table by the time test finished)

postgres=# VACUUM VERBOSE pgbench_accounts ;
INFO:  vacuuming "public.pgbench_accounts"
INFO:  scanned index "pgbench_accounts_pkey" to remove 2797722 row versions
DETAIL:  CPU 0.00s/9.39u sec elapsed 9.39 sec.
INFO:  "pgbench_accounts": removed 2797722 row versions in 865399 pages
DETAIL:  CPU 0.10s/7.01u sec elapsed 7.11 sec.
INFO:  index "pgbench_accounts_pkey" now contains 50000000 row versions in 137099 pages
DETAIL:  2797722 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pgbench_accounts": found 852487 removable, 50000000 nonremovable row versions in 865635 out of 865635 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 802256 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.73s/27.20u sec elapsed 27.98 sec.tuple count at each offset (offnum:all_tuples:dead_tuples)

For 2797722 dead line pointers, the current representation would have used 2797722  x 6 = 16786332 bytes of memory. The most optimal bitmap would have used 121 bits/page x 865399 pages = 13089159 bytes where as if we had provisioned 2D bits/page and assuming D = 64 based on the above calculation, we would have used 13846384 bytes of memory. This is about 18% less than the current representation. Of course, we would have allocated some space for overflow region, which will make the difference smaller/negligible. But the bitmaps would be extremely cheap to lookup during index scans.

Now may be I got lucky, may be I did nor run tests long enough (though I believe that may have worked in favour of bitmap), may be mostly HOT updated tables are not good candidate for testing and may be there are situations where the proposed bitmap representation will fail badly. But these tests show that the idea is at least worth considering and we can improve things for at least some workload. The question is can be avoid regression in not-so-good cases.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Wed, Sep 14, 2016 at 5:45 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> Another interesting bit about these small tables is that the largest used
> offset for these tables never went beyond 291 which is the value of
> MaxHeapTuplesPerPage. I don't know if there is something that prevents
> inserting more than  MaxHeapTuplesPerPage offsets per heap page and I don't
> know at this point if this gives us upper limit for bits per page (may be it
> does).

From PageAddItemExtended:
   /* Reject placing items beyond heap boundary, if heap */   if ((flags & PAI_IS_HEAP) != 0 && offsetNumber >
MaxHeapTuplesPerPage)  {       elog(WARNING, "can't put more than MaxHeapTuplesPerPage items
 
in a heap page");       return InvalidOffsetNumber;   }

Also see the comment where MaxHeapTuplesPerPage is defined:
* Note: with HOT, there could theoretically be more line pointers (not actual* tuples) than this on a heap page.
Howeverwe constrain the number of line* pointers to this anyway, to avoid excessive line-pointer bloat and not* require
increasesin the size of work arrays.
 

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Pavan Deolasee
Date:


On Wed, Sep 14, 2016 at 5:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Sep 14, 2016 at 5:45 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> Another interesting bit about these small tables is that the largest used
> offset for these tables never went beyond 291 which is the value of
> MaxHeapTuplesPerPage. I don't know if there is something that prevents
> inserting more than  MaxHeapTuplesPerPage offsets per heap page and I don't
> know at this point if this gives us upper limit for bits per page (may be it
> does).

From PageAddItemExtended:

    /* Reject placing items beyond heap boundary, if heap */
    if ((flags & PAI_IS_HEAP) != 0 && offsetNumber > MaxHeapTuplesPerPage)
    {
        elog(WARNING, "can't put more than MaxHeapTuplesPerPage items
in a heap page");
        return InvalidOffsetNumber;
    }

Also see the comment where MaxHeapTuplesPerPage is defined:

 * Note: with HOT, there could theoretically be more line pointers (not actual
 * tuples) than this on a heap page.  However we constrain the number of line
 * pointers to this anyway, to avoid excessive line-pointer bloat and not
 * require increases in the size of work arrays.


Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per page bitmap size. Thats about 36 bytes for 8K page. IOW if on an average there are 6 or more dead tuples per page, bitmap will outperform the current representation, assuming max allocation for bitmap. If we can use additional estimates to restrict the size to somewhat more conservative value and then keep overflow area, then probably the break-even happens even earlier than that. I hope this gives us a good starting point, but let me know if you think it's still a wrong approach to pursue.

Thanks,
Pavan 

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Wed, Sep 14, 2016 at 8:16 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per page
> bitmap size. Thats about 36 bytes for 8K page. IOW if on an average there
> are 6 or more dead tuples per page, bitmap will outperform the current
> representation, assuming max allocation for bitmap. If we can use additional
> estimates to restrict the size to somewhat more conservative value and then
> keep overflow area, then probably the break-even happens even earlier than
> that. I hope this gives us a good starting point, but let me know if you
> think it's still a wrong approach to pursue.

Well, it's certainly a bigger change.  I think the big concern is that
the amount of memory now becomes fixed based on the table size.  So
one problem is that you have to figure out what you're going to do if
the bitmap doesn't fit in maintenance_work_mem.  A related problem is
that it might fit but use more memory than before, which could cause
problems for some people.  Now on the other hand it could also use
less memory for some people, and that would be good.

I am kind of doubtful about this whole line of investigation because
we're basically trying pretty hard to fix something that I'm not sure
is broken.    I do agree that, all other things being equal, the TID
lookups will probably be faster with a bitmap than with a binary
search, but maybe not if the table is large and the number of dead
TIDs is small, because cache efficiency is pretty important.  But even
if it's always faster, does TID lookup speed even really matter to
overall VACUUM performance? Claudio's early results suggest that it
might, but maybe that's just a question of some optimization that
hasn't been done yet.

I'm fairly sure that our number one priority should be to minimize the
number of cases where we need to do multiple scans of the indexes to
stay within maintenance_work_mem.  If we're satisfied we've met that
goal, then within that we should try to make VACUUM as fast as
possible with as little memory usage as possible.  I'm not 100% sure I
know how to get there, or how much work it's worth expending.  In
theory we could even start with the list of TIDs and switch to the
bitmap if the TID list becomes larger than the bitmap would have been,
but I don't know if it's worth the effort.

/me thinks a bit.

Actually, I think that probably *is* worthwhile, specifically because
it might let us avoid multiple index scans in cases where we currently
require them.  Right now, our default maintenance_work_mem value is
64MB, which is enough to hold a little over ten million tuples.  It's
also large enough to hold a bitmap for a 14GB table.  So right now if
you deleted, say, 100 tuples per page you would end up with an index
vacuum cycles for every ~100,000 pages = 800MB, whereas switching to
the bitmap representation for such cases would require only one index
vacuum cycle for every 14GB, more than an order of magnitude
improvement!

On the other hand, if we switch to the bitmap as the ONLY possible
representation, we will lose badly when there are scattered updates -
e.g. 1 deleted tuple every 10 pages.  So it seems like we probably
want to have both options.  One tricky part is figuring out how we
switch between them when memory gets tight; we have to avoid bursting
above our memory limit while making the switch.  And even if our
memory limit is very high, we want to avoid using memory gratuitously;
I think we should try to grow memory usage incrementally with either
representation.

For instance, one idea to grow memory usage incrementally would be to
store dead tuple information separately for each 1GB segment of the
relation.  So we have an array of dead-tuple-representation objects,
one for every 1GB of the relation.  If there are no dead tuples in a
given 1GB segment, then this pointer can just be NULL.  Otherwise, it
can point to either the bitmap representation (which will take ~4.5MB)
or it can point to an array of TIDs (which will take 6 bytes/TID).
That could handle an awfully wide variety of usage patterns
efficiently; it's basically never worse than what we're doing today,
and when the dead tuple density is high for any portion of the
relation it's a lot better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> I am kind of doubtful about this whole line of investigation because
> we're basically trying pretty hard to fix something that I'm not sure
> is broken.    I do agree that, all other things being equal, the TID
> lookups will probably be faster with a bitmap than with a binary
> search, but maybe not if the table is large and the number of dead
> TIDs is small, because cache efficiency is pretty important.  But even
> if it's always faster, does TID lookup speed even really matter to
> overall VACUUM performance? Claudio's early results suggest that it
> might, but maybe that's just a question of some optimization that
> hasn't been done yet.

FYI, the reported impact was on CPU time, not runtime. There was no
significant difference in runtime (real time), because my test is
heavily I/O bound.

I tested with a few small tables and there was no significant
difference either, but small tables don't stress the array lookup
anyway so that's expected.

But on the assumption that some systems may be CPU bound during vacuum
(particularly those able to do more than 300-400MB/s sequential I/O),
in those cases the increased or decreased cost of lazy_tid_reaped will
directly correlate to runtime. It's just none of my systems, which all
run on amazon and is heavily bandwidth constrained (fastest I/O
subsystem I can get my hands on does 200MB/s).



Re: Vacuum: allow usage of more than 1GB of work mem

From
Arthur Silva
Date:
<p dir="ltr"><p dir="ltr">On Sep 14, 2016 5:18 PM, "Robert Haas" <<a
href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br /> ><br /> > On Wed, Sep 14, 2016 at
8:16AM, Pavan Deolasee<br /> > <<a href="mailto:pavan.deolasee@gmail.com">pavan.deolasee@gmail.com</a>>
wrote:<br/> > > Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per page<br /> > >
bitmapsize. Thats about 36 bytes for 8K page. IOW if on an average there<br /> > > are 6 or more dead tuples per
page,bitmap will outperform the current<br /> > > representation, assuming max allocation for bitmap. If we can
useadditional<br /> > > estimates to restrict the size to somewhat more conservative value and then<br /> >
>keep overflow area, then probably the break-even happens even earlier than<br /> > > that. I hope this gives
usa good starting point, but let me know if you<br /> > > think it's still a wrong approach to pursue.<br />
><br/> > Well, it's certainly a bigger change.  I think the big concern is that<br /> > the amount of memory
nowbecomes fixed based on the table size.  So<br /> > one problem is that you have to figure out what you're going
todo if<br /> > the bitmap doesn't fit in maintenance_work_mem.  A related problem is<br /> > that it might fit
butuse more memory than before, which could cause<br /> > problems for some people.  Now on the other hand it could
alsouse<br /> > less memory for some people, and that would be good.<br /> ><br /> > I am kind of doubtful
aboutthis whole line of investigation because<br /> > we're basically trying pretty hard to fix something that I'm
notsure<br /> > is broken.    I do agree that, all other things being equal, the TID<br /> > lookups will
probablybe faster with a bitmap than with a binary<br /> > search, but maybe not if the table is large and the
numberof dead<br /> > TIDs is small, because cache efficiency is pretty important.  But even<br /> > if it's
alwaysfaster, does TID lookup speed even really matter to<br /> > overall VACUUM performance? Claudio's early
resultssuggest that it<br /> > might, but maybe that's just a question of some optimization that<br /> > hasn't
beendone yet.<br /> ><br /> > I'm fairly sure that our number one priority should be to minimize the<br /> >
numberof cases where we need to do multiple scans of the indexes to<br /> > stay within maintenance_work_mem.  If
we'resatisfied we've met that<br /> > goal, then within that we should try to make VACUUM as fast as<br /> >
possiblewith as little memory usage as possible.  I'm not 100% sure I<br /> > know how to get there, or how much
workit's worth expending.  In<br /> > theory we could even start with the list of TIDs and switch to the<br /> >
bitmapif the TID list becomes larger than the bitmap would have been,<br /> > but I don't know if it's worth the
effort.<br/> ><br /> > /me thinks a bit.<br /> ><br /> > Actually, I think that probably *is* worthwhile,
specificallybecause<br /> > it might let us avoid multiple index scans in cases where we currently<br /> >
requirethem.  Right now, our default maintenance_work_mem value is<br /> > 64MB, which is enough to hold a little
overten million tuples.  It's<br /> > also large enough to hold a bitmap for a 14GB table.  So right now if<br />
>you deleted, say, 100 tuples per page you would end up with an index<br /> > vacuum cycles for every ~100,000
pages= 800MB, whereas switching to<br /> > the bitmap representation for such cases would require only one index<br
/>> vacuum cycle for every 14GB, more than an order of magnitude<br /> > improvement!<br /> ><br /> > On
theother hand, if we switch to the bitmap as the ONLY possible<br /> > representation, we will lose badly when there
arescattered updates -<br /> > e.g. 1 deleted tuple every 10 pages.  So it seems like we probably<br /> > want to
haveboth options.  One tricky part is figuring out how we<br /> > switch between them when memory gets tight; we
haveto avoid bursting<br /> > above our memory limit while making the switch.  And even if our<br /> > memory
limitis very high, we want to avoid using memory gratuitously;<br /> > I think we should try to grow memory usage
incrementallywith either<br /> > representation.<br /> ><br /> > For instance, one idea to grow memory usage
incrementallywould be to<br /> > store dead tuple information separately for each 1GB segment of the<br /> >
relation. So we have an array of dead-tuple-representation objects,<br /> > one for every 1GB of the relation.  If
thereare no dead tuples in a<br /> > given 1GB segment, then this pointer can just be NULL.  Otherwise, it<br />
>can point to either the bitmap representation (which will take ~4.5MB)<br /> > or it can point to an array of
TIDs(which will take 6 bytes/TID).<br /> > That could handle an awfully wide variety of usage patterns<br /> >
efficiently;it's basically never worse than what we're doing today,<br /> > and when the dead tuple density is high
forany portion of the<br /> > relation it's a lot better.<br /> ><br /> > --<br /> > Robert Haas<br /> >
EnterpriseDB:<a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br /> > The Enterprise PostgreSQL
Company<br/> ><br /> ><br /> > --<br /> > Sent via pgsql-hackers mailing list (<a
href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> > To make changes to your
subscription:<br/> > <a
href="http://www.postgresql.org/mailpref/pgsql-hackers">http://www.postgresql.org/mailpref/pgsql-hackers</a><p
dir="ltr">I'dsay it's an idea worth pursuing. It's the base idea behind roaring bitmaps, arguably the best overall
compressedbitmap implementation. <br /> 

Re: Vacuum: allow usage of more than 1GB of work mem

From
Pavan Deolasee
Date:


On Wed, Sep 14, 2016 at 8:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:


I am kind of doubtful about this whole line of investigation because
we're basically trying pretty hard to fix something that I'm not sure
is broken.    I do agree that, all other things being equal, the TID
lookups will probably be faster with a bitmap than with a binary
search, but maybe not if the table is large and the number of dead
TIDs is small, because cache efficiency is pretty important.  But even
if it's always faster, does TID lookup speed even really matter to
overall VACUUM performance? Claudio's early results suggest that it
might, but maybe that's just a question of some optimization that
hasn't been done yet.
 
Yeah, I wouldn't worry only about lookup speedup, but if does speeds up, that's a bonus. But the bitmaps seem to win even for memory consumption. As theory and experiments both show, at 10% dead tuple ratio, bitmaps will win handsomely.

 In
theory we could even start with the list of TIDs and switch to the
bitmap if the TID list becomes larger than the bitmap would have been,
but I don't know if it's worth the effort.


Yes, that works too. Or may be even better because we already know the bitmap size requirements, definitely for the tuples collected so far. We might need to maintain some more stats to further optimise the representation, but that seems like unnecessary detailing at this point.
 

On the other hand, if we switch to the bitmap as the ONLY possible
representation, we will lose badly when there are scattered updates -
e.g. 1 deleted tuple every 10 pages. 

Sure. I never suggested that. What I'd suggested is to switch back to array representation once we realise bitmaps are not going to work. But I see it's probably better the other way round.

 
So it seems like we probably
want to have both options.  One tricky part is figuring out how we
switch between them when memory gets tight; we have to avoid bursting
above our memory limit while making the switch.

Yes, I was thinking about this problem. Some modelling will be necessary to ensure that we don't go (much) beyond the maintenance_work_mem while switching representation, which probably means you need to do that earlier than necessary.
 
For instance, one idea to grow memory usage incrementally would be to
store dead tuple information separately for each 1GB segment of the
relation.  So we have an array of dead-tuple-representation objects,
one for every 1GB of the relation.  If there are no dead tuples in a
given 1GB segment, then this pointer can just be NULL.  Otherwise, it
can point to either the bitmap representation (which will take ~4.5MB)
or it can point to an array of TIDs (which will take 6 bytes/TID).
That could handle an awfully wide variety of usage patterns
efficiently; it's basically never worse than what we're doing today,
and when the dead tuple density is high for any portion of the
relation it's a lot better.


Yes seems like a good idea. Another idea that I'd in mind is to use some sort of indirection map where bitmap for every block or a set of blocks will either be recorded or not, depending on whether a bit is set for the range. If the bitmap exists, the indirection map will give out the offset into the larger bitmap area. Seems similar to what you described.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> For instance, one idea to grow memory usage incrementally would be to
> store dead tuple information separately for each 1GB segment of the
> relation.  So we have an array of dead-tuple-representation objects,
> one for every 1GB of the relation.  If there are no dead tuples in a
> given 1GB segment, then this pointer can just be NULL.  Otherwise, it
> can point to either the bitmap representation (which will take ~4.5MB)
> or it can point to an array of TIDs (which will take 6 bytes/TID).
> That could handle an awfully wide variety of usage patterns
> efficiently; it's basically never worse than what we're doing today,
> and when the dead tuple density is high for any portion of the
> relation it's a lot better.

If you compress the list into a bitmap a posteriori, you know the
number of tuples per page, so you could encode the bitmap even more
efficiently.

It's not a bad idea, one that can be slapped on top of the multiarray
patch - when closing a segment, it can be decided whether to turn it
into a bitmap or not.



Re: Vacuum: allow usage of more than 1GB of work mem

From
Alvaro Herrera
Date:
Robert Haas wrote:

> Actually, I think that probably *is* worthwhile, specifically because
> it might let us avoid multiple index scans in cases where we currently
> require them.  Right now, our default maintenance_work_mem value is
> 64MB, which is enough to hold a little over ten million tuples.  It's
> also large enough to hold a bitmap for a 14GB table.  So right now if
> you deleted, say, 100 tuples per page you would end up with an index
> vacuum cycles for every ~100,000 pages = 800MB, whereas switching to
> the bitmap representation for such cases would require only one index
> vacuum cycle for every 14GB, more than an order of magnitude
> improvement!

Yeah, this sounds worthwhile.  If we switch to the more compact
in-memory representation close to the point where we figure the TID
array is not going to fit in m_w_m, then we're saving some number of
additional index scans, and I'm pretty sure that the time to transform
from array to bitmap is going to be more than paid back by the I/O
savings.

One thing not quite clear to me is how do we create the bitmap
representation starting from the array representation in midflight
without using twice as much memory transiently.  Are we going to write
the array to a temp file, free the array memory, then fill the bitmap by
reading the array from disk?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

From
Simon Riggs
Date:
On 14 September 2016 at 11:19, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

>>  In
>> theory we could even start with the list of TIDs and switch to the
>> bitmap if the TID list becomes larger than the bitmap would have been,
>> but I don't know if it's worth the effort.
>>
>
> Yes, that works too. Or may be even better because we already know the
> bitmap size requirements, definitely for the tuples collected so far. We
> might need to maintain some more stats to further optimise the
> representation, but that seems like unnecessary detailing at this point.

That sounds best to me... build the simple representation, but as we
do maintain stats to show to what extent that set of tuples is
compressible.

When we hit the limit on memory we can then selectively compress
chunks to stay within memory, starting with the most compressible
chunks.

I think we should use the chunking approach Robert suggests, though
mainly because that allows us to consider how parallel VACUUM should
work - writing the chunks to shmem. That would also allow us to apply
a single global limit for vacuum memory rather than an allocation per
VACUUM.
We can then scan multiple indexes at once in parallel, all accessing
the shmem data structure.

We should also find the compression is better when we consider chunks
rather than the whole data structure at once.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

From
Pavan Deolasee
Date:


On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:


One thing not quite clear to me is how do we create the bitmap
representation starting from the array representation in midflight
without using twice as much memory transiently.  Are we going to write
the array to a temp file, free the array memory, then fill the bitmap by
reading the array from disk?

We could do that. Or may be compress TID array when consumed half m_w_m and do this repeatedly with remaining memory. For example, if we start with 1GB memory, we decide to compress at 512MB. Say that results in 300MB for bitmap. We then continue to accumulate TID and do another round of fold up when another 350MB is consumed.

I think we should maintain per offset count of number of dead tuples to choose the most optimal bitmap size (that needs overflow region). We can also track how many blocks or block ranges have at least one dead tuple to know if it's worthwhile to have some sort of indirection. Together that can tell us how much compression can be achieved and allow us to choose the most optimal representation.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Vacuum: allow usage of more than 1GB of work mem

From
Tom Lane
Date:
Pavan Deolasee <pavan.deolasee@gmail.com> writes:
> On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
>> One thing not quite clear to me is how do we create the bitmap
>> representation starting from the array representation in midflight
>> without using twice as much memory transiently.  Are we going to write
>> the array to a temp file, free the array memory, then fill the bitmap by
>> reading the array from disk?

> We could do that.

People who are vacuuming because they are out of disk space will be very
very unhappy with that solution.
        regards, tom lane



Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Wed, Sep 14, 2016 at 1:23 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> Actually, I think that probably *is* worthwhile, specifically because
>> it might let us avoid multiple index scans in cases where we currently
>> require them.  Right now, our default maintenance_work_mem value is
>> 64MB, which is enough to hold a little over ten million tuples.  It's
>> also large enough to hold a bitmap for a 14GB table.  So right now if
>> you deleted, say, 100 tuples per page you would end up with an index
>> vacuum cycles for every ~100,000 pages = 800MB, whereas switching to
>> the bitmap representation for such cases would require only one index
>> vacuum cycle for every 14GB, more than an order of magnitude
>> improvement!
>
> Yeah, this sounds worthwhile.  If we switch to the more compact
> in-memory representation close to the point where we figure the TID
> array is not going to fit in m_w_m, then we're saving some number of
> additional index scans, and I'm pretty sure that the time to transform
> from array to bitmap is going to be more than paid back by the I/O
> savings.

Yes, that seems pretty clear.  The indexes can be arbitrarily large
and there can be arbitrarily many of them, so we could be save a LOT
of I/O.

> One thing not quite clear to me is how do we create the bitmap
> representation starting from the array representation in midflight
> without using twice as much memory transiently.  Are we going to write
> the array to a temp file, free the array memory, then fill the bitmap by
> reading the array from disk?

I was just thinking about this exact problem while I was out to
lunch.[1]  I wonder if we could do something like this:

1. Allocate an array large enough for one pointer per gigabyte of the
underlying relation.

2. Allocate 64MB, or the remaining amount of maintenance_work_mem if
it's less, to store TIDs.

3. At the beginning of each 1GB chunk, add a pointer to the first free
byte in the slab allocated in step 2 to the array allocated in step 1.
Write a header word that identifies this as a TID list (rather than a
bitmap) and leave space for a TID count; then, write the TIDs
afterwards.  Continue doing this until one of the following things
happens: (a) we reach the end of the 1GB chunk - if that happens,
restart step 3 for the next chunk; (b) we fill the chunk - see step 4,
or (c) we write more TIDs for the chunk than the space being used for
TIDs now exceeds the space needed for a bitmap - see step 5.

4. When we fill up one of the slabs allocated in step 2, allocate a
new one and move the tuples for the current 1GB chunk to the beginning
of the new slab using memmove().  This is wasteful of both CPU time
and memory, but I think it's not that bad.  The maximum possible waste
is less than 10%, and many allocators have more overhead than that.
We could reduce the waste by using, say, 256MB chunks rather than 1GB
chunks.   If no new slab can be allocated because maintenance_work_mem
is completely exhausted (or the remaining space isn't enough for the
TIDs that would need to be moved immediately), then stop and do an
index vacuum cycle.

5. When we write a large enough number of TIDs for 1GB chunk that the
bitmap would be smaller, check whether sufficient maintenance_work_mem
remains to allocate a bitmap for 1GB chunk (~4.5MB).  If not, never
mind; continue with step 3 as if the bitmap representation did not
exist.  If so, allocate space for a bitmap, move all of the TIDs for
the current chunk into it, and update the array allocated in step 1 to
point to it.  Then, finish scanning the current 1GB chunk, updating
that bitmap rather than inserting TIDs into the slab.  Rewind our
pointer into the slab to where it was at the beginning of the current
1GB chunk, so that the memory we consumed for TIDs can be reused now
that those TIDs have been transferred to a bitmap.  If, earlier in the
current 1GB chunk, we did a memmove-to-next-slab operation as
described in step 4, this "rewind" might move our pointer back into
the previous slab, in which case we can free the now-empty slab.  (The
next 1GB segment might have few enough TIDs that they will fit into
the leftover space in the previous slab.)

With this algorithm, we never exceed maintenance_work_mem, not even
transiently.  When memory is no longer sufficient to convert to the
bitmap representation without bursting above maintenance_work_mem, we
simply don't perform the conversion.  Also, we do very little memory
copying.  An alternative I considered was to do a separate allocation
for each 1GB chunk rather than carving the dead-tuple space out of
slabs.  But the problem with that is that you'll have to start those
out small (in case you don't find many dead tuples) and then grow
them, which means reallocating, which is bad both because it can burst
above maintenance_work_mem while the repalloc is in process and also
because you have to keep copying the data from the old chunk to the
new, bigger chunk.  This algorithm only needs to copy TIDs when it
runs off the end of a chunk, and that can't happen more than once
every dozen or so chunks; in contrast, progressively growing the TID
arrays for a given 1GB chunk would potentially memcpy() multiple times
per 1GB chunk, and if you used power-of-two reallocation as we
normally do the waste would be much more than what step 4 of this
algorithm leaves on the table.

There are, nevertheless, corner cases where this can lose: if you had
a number of TIDs that were going to just BARELY fit within
maintenance_work_mem, and if the bitmap representation never wins for
you, the additional array allocated in step 1 and the end-of-slab
wastage in step 4 could push you over the line.  We might be able to
tweak things here or there to reduce the potential for that, but it's
pretty well unavoidable; the flat array we're using right now has
exactly zero allocator overhead, and anything more complex will have
some.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] Or am I always out to lunch?



Re: Vacuum: allow usage of more than 1GB of work mem

From
Masahiko Sawada
Date:
On Thu, Sep 15, 2016 at 2:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 14 September 2016 at 11:19, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>
>>>  In
>>> theory we could even start with the list of TIDs and switch to the
>>> bitmap if the TID list becomes larger than the bitmap would have been,
>>> but I don't know if it's worth the effort.
>>>
>>
>> Yes, that works too. Or may be even better because we already know the
>> bitmap size requirements, definitely for the tuples collected so far. We
>> might need to maintain some more stats to further optimise the
>> representation, but that seems like unnecessary detailing at this point.
>
> That sounds best to me... build the simple representation, but as we
> do maintain stats to show to what extent that set of tuples is
> compressible.
>
> When we hit the limit on memory we can then selectively compress
> chunks to stay within memory, starting with the most compressible
> chunks.
>
> I think we should use the chunking approach Robert suggests, though
> mainly because that allows us to consider how parallel VACUUM should
> work - writing the chunks to shmem. That would also allow us to apply
> a single global limit for vacuum memory rather than an allocation per
> VACUUM.
> We can then scan multiple indexes at once in parallel, all accessing
> the shmem data structure.
>

Yeah, the chunking approach Robert suggested seems like a good idea
but considering implementing parallel vacuum, it would be more
complicated IMO.
I think It's better the multiple process simply allocate memory space
for its process than that the single process allocate huge memory
space using complicated way.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Vacuum: allow usage of more than 1GB of work mem

From
Tomas Vondra
Date:

On 09/14/2016 07:57 PM, Tom Lane wrote:
> Pavan Deolasee <pavan.deolasee@gmail.com> writes:
>> On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
>> wrote:
>>> One thing not quite clear to me is how do we create the bitmap
>>> representation starting from the array representation in midflight
>>> without using twice as much memory transiently.  Are we going to write
>>> the array to a temp file, free the array memory, then fill the bitmap by
>>> reading the array from disk?
>
>> We could do that.
>
> People who are vacuuming because they are out of disk space will be very
> very unhappy with that solution.

The people are usually running out of space for data, while these files 
would be temporary files placed wherever temp_tablespaces points to. I'd 
argue if this is a source of problems, the people are already in deep 
trouble due to sorts, CREATE INDEX, ... as those commands may also 
generate a lot of temporary files.

regards
Tomas



Re: Vacuum: allow usage of more than 1GB of work mem

From
Tomas Vondra
Date:

On 09/14/2016 05:17 PM, Robert Haas wrote:
> I am kind of doubtful about this whole line of investigation because
> we're basically trying pretty hard to fix something that I'm not sure
> is broken.    I do agree that, all other things being equal, the TID
> lookups will probably be faster with a bitmap than with a binary
> search, but maybe not if the table is large and the number of dead
> TIDs is small, because cache efficiency is pretty important.  But even
> if it's always faster, does TID lookup speed even really matter to
> overall VACUUM performance? Claudio's early results suggest that it
> might, but maybe that's just a question of some optimization that
> hasn't been done yet.

Regarding the lookup performance, I don't think the bitmap alone can 
significantly improve that - it's more efficient memory-wise, no doubt 
about that, but it's still likely larger than CPU caches and accessed 
mostly randomly (when vacuuming the indexes).

IMHO the best way to speed-up lookups (if it's really an issue, haven't 
done any benchmarks) would be to build a small bloom filter in front of 
the TID array / bitmap. It shall be fairly small (depending on the 
number of TIDs, error rate etc.) and likely to fit into L2/L3, and 
eliminate a lot of probes into the much larger array/bitmap.

Of course, it's another layer of complexity - the good thing is we don't 
need to build the filter until after we collect the TIDs, so we got 
pretty good inputs for the bloom filter parameters.

But all this is based on the assumption that the lookups are actually 
expensive, not sure about that.

regards
Tomas



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Sep 15, 2016 at 12:50 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 09/14/2016 07:57 PM, Tom Lane wrote:
>>
>> Pavan Deolasee <pavan.deolasee@gmail.com> writes:
>>>
>>> On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera
>>> <alvherre@2ndquadrant.com>
>>> wrote:
>>>>
>>>> One thing not quite clear to me is how do we create the bitmap
>>>> representation starting from the array representation in midflight
>>>> without using twice as much memory transiently.  Are we going to write
>>>> the array to a temp file, free the array memory, then fill the bitmap by
>>>> reading the array from disk?
>>
>>
>>> We could do that.
>>
>>
>> People who are vacuuming because they are out of disk space will be very
>> very unhappy with that solution.
>
>
> The people are usually running out of space for data, while these files
> would be temporary files placed wherever temp_tablespaces points to. I'd
> argue if this is a source of problems, the people are already in deep
> trouble due to sorts, CREATE INDEX, ... as those commands may also generate
> a lot of temporary files.

One would not expect "CREATE INDEX" to succeed when space is tight,
but VACUUM is quite the opposite.

Still, temporary storage could be used if available, and gracefully
fall back to some other technique (like not using bitmaps) when not.

Not sure it's worth the trouble, though.


On Wed, Sep 14, 2016 at 12:24 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> I am kind of doubtful about this whole line of investigation because
>> we're basically trying pretty hard to fix something that I'm not sure
>> is broken.    I do agree that, all other things being equal, the TID
>> lookups will probably be faster with a bitmap than with a binary
>> search, but maybe not if the table is large and the number of dead
>> TIDs is small, because cache efficiency is pretty important.  But even
>> if it's always faster, does TID lookup speed even really matter to
>> overall VACUUM performance? Claudio's early results suggest that it
>> might, but maybe that's just a question of some optimization that
>> hasn't been done yet.
>
> FYI, the reported impact was on CPU time, not runtime. There was no
> significant difference in runtime (real time), because my test is
> heavily I/O bound.
>
> I tested with a few small tables and there was no significant
> difference either, but small tables don't stress the array lookup
> anyway so that's expected.
>
> But on the assumption that some systems may be CPU bound during vacuum
> (particularly those able to do more than 300-400MB/s sequential I/O),
> in those cases the increased or decreased cost of lazy_tid_reaped will
> directly correlate to runtime. It's just none of my systems, which all
> run on amazon and is heavily bandwidth constrained (fastest I/O
> subsystem I can get my hands on does 200MB/s).

Attached is the patch with the multiarray version.

The tests are weird. Best case comparison over several runs, to remove
the impact of concurrent activity on this host (I couldn't remove all
background activity even when running the tests overnight, the distro
adds tons of crons and background cleanup tasks it would seem),
there's only very mild CPU impact. I'd say insignificant, as it's well
below the mean variance.

Worst case:

DETAIL:  CPU 9.90s/80.94u sec elapsed 1232.42 sec.

Best case:

DETAIL:  CPU 12.10s/63.82u sec elapsed 832.79 sec.

There seems to be more variance with the multiarray approach than the
single array one, but I could not figure out why. Even I/O seems less
stable:

Worst case:

INFO:  "pgbench_accounts": removed 400000000 row versions in 6557382 pages
DETAIL:  CPU 64.31s/37.60u sec elapsed 2573.88 sec.

Best case:

INFO:  "pgbench_accounts": removed 400000000 row versions in 6557378 pages
DETAIL:  CPU 54.48s/31.78u sec elapsed 1552.18 sec.

Since this test takes several hours to complete, I could only run a
few runs of each version, so the statistical significance of the test
isn't very bright.

I'll try to compare with smaller pgbench scale numbers and more runs
over the weekend (gotta script that). It's certainly puzzling, I
cannot explain the increased variance, especially in I/O, since the
I/O should be exactly the same. I'm betting it's my system that's
unpredictable somehow. So I'm posting the patch in case someone gets
inspired and can spot the reason, and because there's been a lot of
talk about this very same approach, so I thought I'd better post the
code ;)

I'll also try to get a more predictable system to run the tests on.

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 09/14/2016 07:57 PM, Tom Lane wrote:
>> People who are vacuuming because they are out of disk space will be very
>> very unhappy with that solution.

> The people are usually running out of space for data, while these files 
> would be temporary files placed wherever temp_tablespaces points to. I'd 
> argue if this is a source of problems, the people are already in deep 
> trouble due to sorts, CREATE INDEX, ... as those commands may also 
> generate a lot of temporary files.

Except that if you are trying to recover disk space, VACUUM is what you
are doing, not CREATE INDEX.  Requiring extra disk space to perform a
vacuum successfully is exactly the wrong direction to be going in.
See for example this current commitfest entry:
https://commitfest.postgresql.org/10/649/
Regardless of what you think of the merits of that patch, it's trying
to solve a real-world problem.  And as Robert has already pointed out,
making this aspect of VACUUM more complicated is not solving any
pressing problem.  "But we made it faster" is going to be a poor answer
for the next person who finds themselves up against the wall with no
recourse.
        regards, tom lane



Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Thu, Sep 15, 2016 at 12:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On 09/14/2016 07:57 PM, Tom Lane wrote:
>>> People who are vacuuming because they are out of disk space will be very
>>> very unhappy with that solution.
>
>> The people are usually running out of space for data, while these files
>> would be temporary files placed wherever temp_tablespaces points to. I'd
>> argue if this is a source of problems, the people are already in deep
>> trouble due to sorts, CREATE INDEX, ... as those commands may also
>> generate a lot of temporary files.
>
> Except that if you are trying to recover disk space, VACUUM is what you
> are doing, not CREATE INDEX.  Requiring extra disk space to perform a
> vacuum successfully is exactly the wrong direction to be going in.
> See for example this current commitfest entry:
> https://commitfest.postgresql.org/10/649/
> Regardless of what you think of the merits of that patch, it's trying
> to solve a real-world problem.  And as Robert has already pointed out,
> making this aspect of VACUUM more complicated is not solving any
> pressing problem.  "But we made it faster" is going to be a poor answer
> for the next person who finds themselves up against the wall with no
> recourse.

I very much agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Tomas Vondra
Date:

On 09/15/2016 06:40 PM, Robert Haas wrote:
> On Thu, Sep 15, 2016 at 12:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>> On 09/14/2016 07:57 PM, Tom Lane wrote:
>>>> People who are vacuuming because they are out of disk space will be very
>>>> very unhappy with that solution.
>>
>>> The people are usually running out of space for data, while these files
>>> would be temporary files placed wherever temp_tablespaces points to. I'd
>>> argue if this is a source of problems, the people are already in deep
>>> trouble due to sorts, CREATE INDEX, ... as those commands may also
>>> generate a lot of temporary files.
>>
>> Except that if you are trying to recover disk space, VACUUM is what you
>> are doing, not CREATE INDEX.  Requiring extra disk space to perform a
>> vacuum successfully is exactly the wrong direction to be going in.
>> See for example this current commitfest entry:
>> https://commitfest.postgresql.org/10/649/
>> Regardless of what you think of the merits of that patch, it's trying
>> to solve a real-world problem.  And as Robert has already pointed out,
>> making this aspect of VACUUM more complicated is not solving any
>> pressing problem.  "But we made it faster" is going to be a poor answer
>> for the next person who finds themselves up against the wall with no
>> recourse.
>
> I very much agree.
>

How does VACUUM alone help with recovering disk space? AFAIK it only 
makes the space available for new data, it does not reclaim the disk 
space at all. Sure, we truncate empty pages at the end of the last 
segment, but how likely is that in practice? What I do see people doing 
is usually either VACUUM FULL (which is however doomed for obvious 
reasons) or VACUUM + reindexing to get rid of index bloat (which however 
leads to CREATE INDEX using temporary files).

I'm not sure I agree with your claim there's no pressing problem. We do 
see quite a few people having to do VACUUM with multiple index scans 
(because the TIDs don't fit into m_w_m), which certainly has significant 
impact on production systems - both in terms of performance and it also 
slows down reclaiming the space. Sure, being able to set m_w_m above 1GB 
is an improvement, but perhaps using a more efficient TID storage would 
improve the situation further. Writing the TIDs to a temporary file may 
not the right approach, but I don't see why that would make the original 
problem less severe?

For example, we always allocate the TID array as large as we can fit 
into m_w_m, but maybe we don't need to wait with switching to the bitmap 
until filling the whole array - we could wait as long as the bitmap fits 
into the remaining part of the array, build it there and then copy it to 
the beginning (and use the bitmap from that point).

regards
Tomas



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Sep 15, 2016 at 3:48 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> For example, we always allocate the TID array as large as we can fit into
> m_w_m, but maybe we don't need to wait with switching to the bitmap until
> filling the whole array - we could wait as long as the bitmap fits into the
> remaining part of the array, build it there and then copy it to the
> beginning (and use the bitmap from that point).

The bitmap can be created like that, but grow from the end of the
segment backwards.

So the scan can proceed until the bitmap fills the whole segment
(filling backwards), no copy required.

I'll try that later, but first I'd like to get multiarray approach
right since that's the basis of it.



Re: Vacuum: allow usage of more than 1GB of work mem

From
Pavan Deolasee
Date:


On Fri, Sep 16, 2016 at 12:24 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Thu, Sep 15, 2016 at 3:48 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> For example, we always allocate the TID array as large as we can fit into
> m_w_m, but maybe we don't need to wait with switching to the bitmap until
> filling the whole array - we could wait as long as the bitmap fits into the
> remaining part of the array, build it there and then copy it to the
> beginning (and use the bitmap from that point).

The bitmap can be created like that, but grow from the end of the
segment backwards.

So the scan can proceed until the bitmap fills the whole segment
(filling backwards), no copy required.

I thought about those approaches when I suggested starting with half m_w_m. So you fill in TIDs from one end and upon reaching half point, convert that into bitmap (assuming stats tell you that there is significant savings with bitmaps) but fill it from the other end. Then reset the TID array and start filling up again. That guarantees that you can always work within available limit.

But I actually wonder if we are over engineering things and overestimating cost of memmove etc. How about this simpler approach:

1. Divide table in some fixed size chunks like Robert suggested. Say 1GB
2. Allocate pointer array to store a pointer to each segment. For 1TB table, thats about 8192 bytes.
3. Allocate a bitmap which can hold MaxHeapTuplesPerPage * chunk size in pages. For 8192 block and 1GB chunk, thats about 4.6MB. Note: I'm suggesting to use a bitmap here because provisioning for worst case, fixed size TID array will cost us 200MB+ where as a bitmap is just 4.6MB.
4. Collect dead tuples in that 1GB chunk. Also collect stats so that we know about the most optimal representation.
5. At the end of 1GB scan, if no dead tuples found, set the chunk pointer to NULL, move to next chunk and restart step 4. If dead tuples found, then check:
   a. If bitmap can be further compressed by using less number of bits per page. If so, allocate a new bitmap and compress the bitmap.
   b. If TID array will be a more compact representation. If so, allocate a TID array of right size and convert bitmap into an array.
   c. Set chunk pointer to whichever representation we choose (of course add headers etc to interpret the representation)
6. Continue until we consume all m_w_m or end-of-table is reached. If we consume all m_w_m then do a round of index cleanup and restart.

I also realised that we can compact the TID array in step (b) above because we only need to store 17 bits for block numbers (we already know which 1GB segment they belong to). Given that usable offsets are also just 13 bits, TID array needs only 4 bytes per TID instead of 6. 

Many people are working on implementing different ideas, and I can volunteer to write a patch on these lines unless someone wants to do that.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Vacuum: allow usage of more than 1GB of work mem

From
Pavan Deolasee
Date:


On Fri, Sep 16, 2016 at 9:09 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

I also realised that we can compact the TID array in step (b) above because we only need to store 17 bits for block numbers (we already know which 1GB segment they belong to). Given that usable offsets are also just 13 bits, TID array needs only 4 bytes per TID instead of 6. 


Actually this seems like a clear savings of at least 30% for all use cases, at the cost of allocating in smaller chunks and doing some transformations. But given the problem we are trying to solve, seems like a small price to pay.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Thu, Sep 15, 2016 at 11:39 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> But I actually wonder if we are over engineering things and overestimating
> cost of memmove etc. How about this simpler approach:

Don't forget that you need to handle the case where
maintenance_work_mem is quite small.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Pavan Deolasee
Date:


On Fri, Sep 16, 2016 at 7:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 15, 2016 at 11:39 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> But I actually wonder if we are over engineering things and overestimating
> cost of memmove etc. How about this simpler approach:

Don't forget that you need to handle the case where
maintenance_work_mem is quite small.


How small? The default IIRC these days is 64MB and minimum is 1MB. I think we can do some special casing for very small values and ensure that things at the very least work and hopefully don't regress for them. 

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Fri, Sep 16, 2016 at 9:47 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> On Fri, Sep 16, 2016 at 7:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Sep 15, 2016 at 11:39 PM, Pavan Deolasee
>> <pavan.deolasee@gmail.com> wrote:
>> > But I actually wonder if we are over engineering things and
>> > overestimating
>> > cost of memmove etc. How about this simpler approach:
>>
>> Don't forget that you need to handle the case where
>> maintenance_work_mem is quite small.
>
> How small? The default IIRC these days is 64MB and minimum is 1MB. I think
> we can do some special casing for very small values and ensure that things
> at the very least work and hopefully don't regress for them.

Sounds like you need to handle values as small as 1MB, then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Sep 15, 2016 at 1:16 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Wed, Sep 14, 2016 at 12:24 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>
>>> I am kind of doubtful about this whole line of investigation because
>>> we're basically trying pretty hard to fix something that I'm not sure
>>> is broken.    I do agree that, all other things being equal, the TID
>>> lookups will probably be faster with a bitmap than with a binary
>>> search, but maybe not if the table is large and the number of dead
>>> TIDs is small, because cache efficiency is pretty important.  But even
>>> if it's always faster, does TID lookup speed even really matter to
>>> overall VACUUM performance? Claudio's early results suggest that it
>>> might, but maybe that's just a question of some optimization that
>>> hasn't been done yet.
>>
>> FYI, the reported impact was on CPU time, not runtime. There was no
>> significant difference in runtime (real time), because my test is
>> heavily I/O bound.
>>
>> I tested with a few small tables and there was no significant
>> difference either, but small tables don't stress the array lookup
>> anyway so that's expected.
>>
>> But on the assumption that some systems may be CPU bound during vacuum
>> (particularly those able to do more than 300-400MB/s sequential I/O),
>> in those cases the increased or decreased cost of lazy_tid_reaped will
>> directly correlate to runtime. It's just none of my systems, which all
>> run on amazon and is heavily bandwidth constrained (fastest I/O
>> subsystem I can get my hands on does 200MB/s).
>
> Attached is the patch with the multiarray version.
>
> The tests are weird. Best case comparison over several runs, to remove
> the impact of concurrent activity on this host (I couldn't remove all
> background activity even when running the tests overnight, the distro
> adds tons of crons and background cleanup tasks it would seem),
> there's only very mild CPU impact. I'd say insignificant, as it's well
> below the mean variance.

I reran the tests on a really dedicated system, and with a script that
captured a lot more details about the runs.

The system isn't impressive, an i5 with a single consumer HD and 8GB
RAM, but it did the job.

These tests make more sense, so I bet it was the previous tests that
were spoiled by concurrent activity on the host.

Attached is the raw output of the test, the script used to create it,
and just in case the patch set used. I believe it's the same as the
last one I posted, just rebased.

In the results archive, the .vacuum prefix is the patched version with
both patch 1 and 2, .git.ref is just patch 1 (without which the
truncate takes unholily long).

Grepping the results a bit, picking an average run out of all runs on
each scale:

Timings:

Patched:

s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s.
s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s.
s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s.

Unpatched:

s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s.
s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s.
s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s.

Total I/O (in MB)

Patched:

s100: R:2.4 - W:5862
s400: R:1337.4 - W:29385.6
s4000: R:318631 - W:370154

Unpatched:

s100: R:1412.4 - W:7644.6
s400: R:3180.6 - W:36281.4
s4000: R:330683 - W:370391


So, in essence, CPU time didn't get adversely affected. If anything,
it got improved by about 20% on the biggest case (scale 4000).

While total runtime didn't change much, I believe this is only due to
the fact that the index is perfectly correlated (clustered?) since
it's a pristine index, so index scans either remove or skip full
pages, never leaving things half-way. A bloated index would probably
show a substantially different behavior, I'll try to get a script that
does it by running pgbench a while before the vacuum or something like
that.

However, the total I/O metric already shows remarkable improvement.
This metric is measuring all the I/O including pgbench init, the
initial vacuum pgbench init does, the delete and the final vacuum. So
it's not just the I/O for the vacuum itself, but the whole run. We can
see the patched version reading a lot less (less scans over the
indexes will do that), and in some cases writing less too (again, less
index scans may be performing less redundant writes when cleaning
up/reclaiming index pages).

I'll post when I get the results for the bloated case, but I believe
this already shows substantial improvement as is.

This approach can later be improved upon by turning tid segments into
bitmaps if they're packed densely enough, but I believe this patch
represents a sensible first step before attempting that.

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Masahiko Sawada
Date:
On Thu, Oct 27, 2016 at 5:25 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Thu, Sep 15, 2016 at 1:16 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Wed, Sep 14, 2016 at 12:24 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>>
>>>> I am kind of doubtful about this whole line of investigation because
>>>> we're basically trying pretty hard to fix something that I'm not sure
>>>> is broken.    I do agree that, all other things being equal, the TID
>>>> lookups will probably be faster with a bitmap than with a binary
>>>> search, but maybe not if the table is large and the number of dead
>>>> TIDs is small, because cache efficiency is pretty important.  But even
>>>> if it's always faster, does TID lookup speed even really matter to
>>>> overall VACUUM performance? Claudio's early results suggest that it
>>>> might, but maybe that's just a question of some optimization that
>>>> hasn't been done yet.
>>>
>>> FYI, the reported impact was on CPU time, not runtime. There was no
>>> significant difference in runtime (real time), because my test is
>>> heavily I/O bound.
>>>
>>> I tested with a few small tables and there was no significant
>>> difference either, but small tables don't stress the array lookup
>>> anyway so that's expected.
>>>
>>> But on the assumption that some systems may be CPU bound during vacuum
>>> (particularly those able to do more than 300-400MB/s sequential I/O),
>>> in those cases the increased or decreased cost of lazy_tid_reaped will
>>> directly correlate to runtime. It's just none of my systems, which all
>>> run on amazon and is heavily bandwidth constrained (fastest I/O
>>> subsystem I can get my hands on does 200MB/s).
>>
>> Attached is the patch with the multiarray version.
>>
>> The tests are weird. Best case comparison over several runs, to remove
>> the impact of concurrent activity on this host (I couldn't remove all
>> background activity even when running the tests overnight, the distro
>> adds tons of crons and background cleanup tasks it would seem),
>> there's only very mild CPU impact. I'd say insignificant, as it's well
>> below the mean variance.
>
> I reran the tests on a really dedicated system, and with a script that
> captured a lot more details about the runs.
>
> The system isn't impressive, an i5 with a single consumer HD and 8GB
> RAM, but it did the job.
>
> These tests make more sense, so I bet it was the previous tests that
> were spoiled by concurrent activity on the host.
>
> Attached is the raw output of the test, the script used to create it,
> and just in case the patch set used. I believe it's the same as the
> last one I posted, just rebased.

I glanced at the patches but the both patches don't obey the coding
style of PostgreSQL.
Please refer to [1].

[1] http://wiki.postgresql.org/wiki/Developer_FAQ#What.27s_the_formatting_style_used_in_PostgreSQL_source_code.3F.

>
> In the results archive, the .vacuum prefix is the patched version with
> both patch 1 and 2, .git.ref is just patch 1 (without which the
> truncate takes unholily long).

Did you measure the performance benefit of 0001 patch by comparing
HEAD with HEAD+0001 patch?

> Grepping the results a bit, picking an average run out of all runs on
> each scale:
>
> Timings:
>
> Patched:
>
> s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s.
> s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s.
> s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s.
>
> Unpatched:
>
> s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s.
> s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s.
> s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s.
>
> Total I/O (in MB)
>
> Patched:
>
> s100: R:2.4 - W:5862
> s400: R:1337.4 - W:29385.6
> s4000: R:318631 - W:370154
>
> Unpatched:
>
> s100: R:1412.4 - W:7644.6
> s400: R:3180.6 - W:36281.4
> s4000: R:330683 - W:370391
>
>
> So, in essence, CPU time didn't get adversely affected. If anything,
> it got improved by about 20% on the biggest case (scale 4000).

And this test case deletes all tuples in relation and then measure
duration of vacuum.
It would not be effect much in practical use case.

> While total runtime didn't change much, I believe this is only due to
> the fact that the index is perfectly correlated (clustered?) since
> it's a pristine index, so index scans either remove or skip full
> pages, never leaving things half-way. A bloated index would probably
> show a substantially different behavior, I'll try to get a script that
> does it by running pgbench a while before the vacuum or something like
> that.
>
> However, the total I/O metric already shows remarkable improvement.
> This metric is measuring all the I/O including pgbench init, the
> initial vacuum pgbench init does, the delete and the final vacuum. So
> it's not just the I/O for the vacuum itself, but the whole run. We can
> see the patched version reading a lot less (less scans over the
> indexes will do that), and in some cases writing less too (again, less
> index scans may be performing less redundant writes when cleaning
> up/reclaiming index pages).

What value of maintenance_work_mem did you use for this test? Since
DeadTuplesSegment struct still stores array of ItemPointerData(6byte)
representing dead tuple I supposed that the frequency of index vacuum
does not change. But according to the test result, a index vacuum is
invoked once and removes 400000000 rows at the time. It means that the
vacuum stored about 2289 MB memory during heap vacuum. On the other
side, the result of test without 0002 patch show that a index vacuum
remove 178956737 rows at the time, which means 1GB memory was used.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Nov 17, 2016 at 2:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I glanced at the patches but the both patches don't obey the coding
> style of PostgreSQL.
> Please refer to [1].
>
> [1] http://wiki.postgresql.org/wiki/Developer_FAQ#What.27s_the_formatting_style_used_in_PostgreSQL_source_code.3F.

I thought I had. I'll go through that list to check what I missed.

>> In the results archive, the .vacuum prefix is the patched version with
>> both patch 1 and 2, .git.ref is just patch 1 (without which the
>> truncate takes unholily long).
>
> Did you measure the performance benefit of 0001 patch by comparing
> HEAD with HEAD+0001 patch?

Not the whole test, but yes. Without the 0001 patch the backward scan
step during truncate goes between 3 and 5 times slower. I don't have
timings because the test never finished without the patch. It would
have finished, but it would have taken about a day.

>> Grepping the results a bit, picking an average run out of all runs on
>> each scale:
>>
>> Timings:
>>
>> Patched:
>>
>> s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s.
>> s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s.
>> s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s.
>>
>> Unpatched:
>>
>> s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s.
>> s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s.
>> s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s.
>>
>> Total I/O (in MB)
>>
>> Patched:
>>
>> s100: R:2.4 - W:5862
>> s400: R:1337.4 - W:29385.6
>> s4000: R:318631 - W:370154
>>
>> Unpatched:
>>
>> s100: R:1412.4 - W:7644.6
>> s400: R:3180.6 - W:36281.4
>> s4000: R:330683 - W:370391
>>
>>
>> So, in essence, CPU time didn't get adversely affected. If anything,
>> it got improved by about 20% on the biggest case (scale 4000).
>
> And this test case deletes all tuples in relation and then measure
> duration of vacuum.
> It would not be effect much in practical use case.

Well, this patch set started because I had to do exactly that, and
realized just how inefficient vacuum was in that case.

But it doesn't mean it won't benefit more realistic use cases. Almost
any big database ends up hitting this 1GB limit because big tables
take very long to vacuum and accumulate a lot of bloat in-between
vacuums.

If you have a specific test case in mind I can try to run it.

>> While total runtime didn't change much, I believe this is only due to
>> the fact that the index is perfectly correlated (clustered?) since
>> it's a pristine index, so index scans either remove or skip full
>> pages, never leaving things half-way. A bloated index would probably
>> show a substantially different behavior, I'll try to get a script that
>> does it by running pgbench a while before the vacuum or something like
>> that.
>>
>> However, the total I/O metric already shows remarkable improvement.
>> This metric is measuring all the I/O including pgbench init, the
>> initial vacuum pgbench init does, the delete and the final vacuum. So
>> it's not just the I/O for the vacuum itself, but the whole run. We can
>> see the patched version reading a lot less (less scans over the
>> indexes will do that), and in some cases writing less too (again, less
>> index scans may be performing less redundant writes when cleaning
>> up/reclaiming index pages).
>
> What value of maintenance_work_mem did you use for this test?

4GB on both, patched and HEAD.

> Since
> DeadTuplesSegment struct still stores array of ItemPointerData(6byte)
> representing dead tuple I supposed that the frequency of index vacuum
> does not change. But according to the test result, a index vacuum is
> invoked once and removes 400000000 rows at the time. It means that the
> vacuum stored about 2289 MB memory during heap vacuum. On the other
> side, the result of test without 0002 patch show that a index vacuum
> remove 178956737 rows at the time, which means 1GB memory was used.

1GB is a hardcoded limit on HEAD for vacuum work mem. This patch
removes that limit and lets vacuum use all the memory the user gave to
vacuum.

In the test, in both cases, 4GB was used as maintenance_work_mem
value, but HEAD cannot use all the 4GB, and it will limit itself to
just 1GB.



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Nov 17, 2016 at 2:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Thu, Nov 17, 2016 at 2:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> I glanced at the patches but the both patches don't obey the coding
>> style of PostgreSQL.
>> Please refer to [1].
>>
>> [1] http://wiki.postgresql.org/wiki/Developer_FAQ#What.27s_the_formatting_style_used_in_PostgreSQL_source_code.3F.
>
> I thought I had. I'll go through that list to check what I missed.

Attached is patch 0002 with pgindent applied over it

I don't think there's any other formatting issue, but feel free to
point a finger to it if I missed any

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Attached is patch 0002 with pgindent applied over it
>
> I don't think there's any other formatting issue, but feel free to
> point a finger to it if I missed any

Hmm, I had imagined making all of the segments the same size rather
than having the size grow exponentially.  The whole point of this is
to save memory, and even in the worst case you don't end up with that
many segments as long as you pick a reasonable base size (e.g. 64MB).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Attached is patch 0002 with pgindent applied over it
>
> I don't think there's any other formatting issue, but feel free to
> point a finger to it if I missed any

Hmm, I had imagined making all of the segments the same size rather
than having the size grow exponentially.  The whole point of this is
to save memory, and even in the worst case you don't end up with that
many segments as long as you pick a reasonable base size (e.g. 64MB).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Attached is patch 0002 with pgindent applied over it
>>
>> I don't think there's any other formatting issue, but feel free to
>> point a finger to it if I missed any
>
> Hmm, I had imagined making all of the segments the same size rather
> than having the size grow exponentially.  The whole point of this is
> to save memory, and even in the worst case you don't end up with that
> many segments as long as you pick a reasonable base size (e.g. 64MB).

Wastage is bound by a fraction of the total required RAM, that is,
it's proportional to the amount of required RAM, not the amount
allocated. So it should still be fine, and the exponential strategy
should improve lookup performance considerably.



Re: Vacuum: allow usage of more than 1GB of work mem

From
Masahiko Sawada
Date:
On Fri, Nov 18, 2016 at 6:54 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> Attached is patch 0002 with pgindent applied over it
>>>
>>> I don't think there's any other formatting issue, but feel free to
>>> point a finger to it if I missed any
>>
>> Hmm, I had imagined making all of the segments the same size rather
>> than having the size grow exponentially.  The whole point of this is
>> to save memory, and even in the worst case you don't end up with that
>> many segments as long as you pick a reasonable base size (e.g. 64MB).
>
> Wastage is bound by a fraction of the total required RAM, that is,
> it's proportional to the amount of required RAM, not the amount
> allocated. So it should still be fine, and the exponential strategy
> should improve lookup performance considerably.

I'm concerned that it could use repalloc for large memory area when
vacrelstats->dead_tuples.dead_tuple is bloated. It would be overhead
and slow. What about using semi-fixed memroy space without repalloc;
Allocate the array of ItemPointerData array, and each ItemPointerData
array stores the dead tuple locations. The size of ItemPointerData
array starts with small size (e.g. 16MB or 32MB). After we used an
array up, we then allocate next segment with twice size as previous
segment. But to prevent over allocating memory, it would be better to
set a limit of allocating size of ItemPointerData array to 512MB or
1GB. We could expand array of array using repalloc if needed, but it
would not be happend much. Other design is similar to current your
patch; the offset of the array of array and the offset of a
ItemPointerData array control current location, which are cleared
after finished reclaiming garbage on heap and index. And allocated
array is re-used by subsequent scanning heap.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Mon, Nov 21, 2016 at 2:15 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Nov 18, 2016 at 6:54 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>>> Attached is patch 0002 with pgindent applied over it
>>>>
>>>> I don't think there's any other formatting issue, but feel free to
>>>> point a finger to it if I missed any
>>>
>>> Hmm, I had imagined making all of the segments the same size rather
>>> than having the size grow exponentially.  The whole point of this is
>>> to save memory, and even in the worst case you don't end up with that
>>> many segments as long as you pick a reasonable base size (e.g. 64MB).
>>
>> Wastage is bound by a fraction of the total required RAM, that is,
>> it's proportional to the amount of required RAM, not the amount
>> allocated. So it should still be fine, and the exponential strategy
>> should improve lookup performance considerably.
>
> I'm concerned that it could use repalloc for large memory area when
> vacrelstats->dead_tuples.dead_tuple is bloated. It would be overhead
> and slow.

How large?

That array cannot be very large. It contains pointers to
exponentially-growing arrays, but the repalloc'd array itself is
small: one struct per segment, each segment starts at 128MB and grows
exponentially.

In fact, IIRC, it can be proven that such a repalloc strategy has an
amortized cost of O(log log n) per tuple. If it repallocd the whole
tid array it would be O(log n), but since it handles only pointers to
segments of exponentially growing tuples it becomes O(log log n).

Furthermore, n there is still limited to MAX_INT, which means the cost
per tuple is bound by O(log log 2^32) = 5. And that's an absolute
worst case that's ignoring the 128MB constant factor which is indeed
relevant.

> What about using semi-fixed memroy space without repalloc;
> Allocate the array of ItemPointerData array, and each ItemPointerData
> array stores the dead tuple locations. The size of ItemPointerData
> array starts with small size (e.g. 16MB or 32MB). After we used an
> array up, we then allocate next segment with twice size as previous
> segment.

That's what the patch does.

> But to prevent over allocating memory, it would be better to
> set a limit of allocating size of ItemPointerData array to 512MB or
> 1GB.

There already is a limit to 1GB (the maximum amount palloc can allocate)



Re: Vacuum: allow usage of more than 1GB of work mem

From
Haribabu Kommi
Date:


On Tue, Nov 22, 2016 at 4:53 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Mon, Nov 21, 2016 at 2:15 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Nov 18, 2016 at 6:54 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>>> Attached is patch 0002 with pgindent applied over it
>>>>
>>>> I don't think there's any other formatting issue, but feel free to
>>>> point a finger to it if I missed any
>>>
>>> Hmm, I had imagined making all of the segments the same size rather
>>> than having the size grow exponentially.  The whole point of this is
>>> to save memory, and even in the worst case you don't end up with that
>>> many segments as long as you pick a reasonable base size (e.g. 64MB).
>>
>> Wastage is bound by a fraction of the total required RAM, that is,
>> it's proportional to the amount of required RAM, not the amount
>> allocated. So it should still be fine, and the exponential strategy
>> should improve lookup performance considerably.
>
> I'm concerned that it could use repalloc for large memory area when
> vacrelstats->dead_tuples.dead_tuple is bloated. It would be overhead
> and slow.

How large?

That array cannot be very large. It contains pointers to
exponentially-growing arrays, but the repalloc'd array itself is
small: one struct per segment, each segment starts at 128MB and grows
exponentially.

In fact, IIRC, it can be proven that such a repalloc strategy has an
amortized cost of O(log log n) per tuple. If it repallocd the whole
tid array it would be O(log n), but since it handles only pointers to
segments of exponentially growing tuples it becomes O(log log n).

Furthermore, n there is still limited to MAX_INT, which means the cost
per tuple is bound by O(log log 2^32) = 5. And that's an absolute
worst case that's ignoring the 128MB constant factor which is indeed
relevant.

> What about using semi-fixed memroy space without repalloc;
> Allocate the array of ItemPointerData array, and each ItemPointerData
> array stores the dead tuple locations. The size of ItemPointerData
> array starts with small size (e.g. 16MB or 32MB). After we used an
> array up, we then allocate next segment with twice size as previous
> segment.

That's what the patch does.

> But to prevent over allocating memory, it would be better to
> set a limit of allocating size of ItemPointerData array to 512MB or
> 1GB.

There already is a limit to 1GB (the maximum amount palloc can allocate)


Moved to next CF with "needs review" status.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Anastasia Lubennikova
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            not tested

Hi,
I haven't read through the thread yet, just tried to apply the patch and run tests.
And it seems that the last attached version is outdated now. It doesn't apply to the master
and after I've tried to fix merge conflict, it segfaults at initdb. 

So, looking forward to a new version for review.

The new status of this patch is: Waiting on Author

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova
<lubennikovaav@gmail.com> wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, failed
> Implements feature:       not tested
> Spec compliant:           not tested
> Documentation:            not tested
>
> Hi,
> I haven't read through the thread yet, just tried to apply the patch and run tests.
> And it seems that the last attached version is outdated now. It doesn't apply to the master
> and after I've tried to fix merge conflict, it segfaults at initdb.


I'll rebase when I get some time to do it and post an updated version



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Dec 22, 2016 at 12:22 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova
> <lubennikovaav@gmail.com> wrote:
>> The following review has been posted through the commitfest application:
>> make installcheck-world:  tested, failed
>> Implements feature:       not tested
>> Spec compliant:           not tested
>> Documentation:            not tested
>>
>> Hi,
>> I haven't read through the thread yet, just tried to apply the patch and run tests.
>> And it seems that the last attached version is outdated now. It doesn't apply to the master
>> and after I've tried to fix merge conflict, it segfaults at initdb.
>
>
> I'll rebase when I get some time to do it and post an updated version

Attached rebased patches. I called them both v3 to be consistent.

I'm not sure how you ran it, but this works fine for me:

./configure --enable-debug --enable-cassert
make clean
make check

... after a while ...

=======================
 All 168 tests passed.
=======================

I reckon the above is equivalent to installcheck, but doesn't require
sudo or actually installing the server, so installcheck should work
assuming the install went ok.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Anastasia Lubennikova
Date:
22.12.2016 21:18, Claudio Freire:
> On Thu, Dec 22, 2016 at 12:22 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova
>> <lubennikovaav@gmail.com> wrote:
>>> The following review has been posted through the commitfest application:
>>> make installcheck-world:  tested, failed
>>> Implements feature:       not tested
>>> Spec compliant:           not tested
>>> Documentation:            not tested
>>>
>>> Hi,
>>> I haven't read through the thread yet, just tried to apply the patch and run tests.
>>> And it seems that the last attached version is outdated now. It doesn't apply to the master
>>> and after I've tried to fix merge conflict, it segfaults at initdb.
>>
>> I'll rebase when I get some time to do it and post an updated version
> Attached rebased patches. I called them both v3 to be consistent.
>
> I'm not sure how you ran it, but this works fine for me:
>
> ./configure --enable-debug --enable-cassert
> make clean
> make check
>
> ... after a while ...
>
> =======================
>   All 168 tests passed.
> =======================
>
> I reckon the above is equivalent to installcheck, but doesn't require
> sudo or actually installing the server, so installcheck should work
> assuming the install went ok.

I found the reason. I configure postgres with CFLAGS="-O0" and it causes 
Segfault on initdb.
It works fine and passes tests with default configure flags, but I'm 
pretty sure that we should fix segfault before testing the feature.
If you need it, I'll send a core dump.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Fri, Dec 23, 2016 at 1:39 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
>> On Thu, Dec 22, 2016 at 12:22 PM, Claudio Freire <klaussfreire@gmail.com>
>> wrote:
>>>
>>> On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova
>>> <lubennikovaav@gmail.com> wrote:
>>>>
>>>> The following review has been posted through the commitfest application:
>>>> make installcheck-world:  tested, failed
>>>> Implements feature:       not tested
>>>> Spec compliant:           not tested
>>>> Documentation:            not tested
>>>>
>>>> Hi,
>>>> I haven't read through the thread yet, just tried to apply the patch and
>>>> run tests.
>>>> And it seems that the last attached version is outdated now. It doesn't
>>>> apply to the master
>>>> and after I've tried to fix merge conflict, it segfaults at initdb.
>>>
>>>
>>> I'll rebase when I get some time to do it and post an updated version
>>
>> Attached rebased patches. I called them both v3 to be consistent.
>>
>> I'm not sure how you ran it, but this works fine for me:
>>
>> ./configure --enable-debug --enable-cassert
>> make clean
>> make check
>>
>> ... after a while ...
>>
>> =======================
>>   All 168 tests passed.
>> =======================
>>
>> I reckon the above is equivalent to installcheck, but doesn't require
>> sudo or actually installing the server, so installcheck should work
>> assuming the install went ok.
>
>
> I found the reason. I configure postgres with CFLAGS="-O0" and it causes
> Segfault on initdb.
> It works fine and passes tests with default configure flags, but I'm pretty
> sure that we should fix segfault before testing the feature.
> If you need it, I'll send a core dump.

I just ran it with CFLAGS="-O0" and it passes all checks too:

CFLAGS='-O0' ./configure --enable-debug --enable-cassert
make clean && make -j8 && make check-world

A stacktrace and a thorough description of your build environment
would be helpful to understand why it breaks on your system.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Anastasia Lubennikova
Date:
23.12.2016 22:54, Claudio Freire:
On Fri, Dec 23, 2016 at 1:39 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
I found the reason. I configure postgres with CFLAGS="-O0" and it causes
Segfault on initdb.
It works fine and passes tests with default configure flags, but I'm pretty
sure that we should fix segfault before testing the feature.
If you need it, I'll send a core dump.
I just ran it with CFLAGS="-O0" and it passes all checks too:

CFLAGS='-O0' ./configure --enable-debug --enable-cassert
make clean && make -j8 && make check-world

A stacktrace and a thorough description of your build environment
would be helpful to understand why it breaks on your system.

I ran configure using following set of flags:
 ./configure --enable-tap-tests --enable-cassert --enable-debug --enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer"
And then ran make check. Here is the stacktrace:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
1417                tblk = ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]);
(gdb) bt
#0  0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
#1  0x0000000000693dfe in lazy_scan_heap (onerel=0x1ec2360, options=9, vacrelstats=0x1ef6e00, Irel=0x1ef7168, nindexes=2, aggressive=1 '\001')
    at vacuumlazy.c:1337
#2  0x0000000000691e66 in lazy_vacuum_rel (onerel=0x1ec2360, options=9, params=0x7ffe0f866310, bstrategy=0x1f1c4a8) at vacuumlazy.c:290
#3  0x000000000069191f in vacuum_rel (relid=1247, relation=0x0, options=9, params=0x7ffe0f866310) at vacuum.c:1418
#4  0x0000000000690122 in vacuum (options=9, relation=0x0, relid=0, params=0x7ffe0f866310, va_cols=0x0, bstrategy=0x1f1c4a8,
    isTopLevel=1 '\001') at vacuum.c:320
#5  0x000000000068fd0b in vacuum (options=-1652367447, relation=0x0, relid=3324614038, params=0x1f11bf0, va_cols=0xb59f63,
    bstrategy=0x1f1c620, isTopLevel=0 '\000') at vacuum.c:150
#6  0x0000000000852993 in standard_ProcessUtility (parsetree=0x1f07e60, queryString=0x1f07468 "VACUUM FREEZE;\n",
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0xea5cc0 <debugtupDR>, completionTag=0x7ffe0f866750 "") at utility.c:669
#7  0x00000000008520da in standard_ProcessUtility (parsetree=0x401ef6cd8, queryString=0x18 <error: Cannot access memory at address 0x18>,
    context=PROCESS_UTILITY_TOPLEVEL, params=0x68, dest=0x9e5d62 <AllocSetFree+60>, completionTag=0x7ffe0f8663f0 "`~\360\001")
    at utility.c:360
#8  0x0000000000851161 in PortalRunMulti (portal=0x7ffe0f866750, isTopLevel=0 '\000', setHoldSnapshot=-39 '\331',
    dest=0x851161 <PortalRunMulti+19>, altdest=0x7ffe0f8664f0, completionTag=0x1f07e60 "\341\002") at pquery.c:1219
#9  0x0000000000851374 in PortalRunMulti (portal=0x1f0a488, isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0xea5cc0 <debugtupDR>,
    altdest=0xea5cc0 <debugtupDR>, completionTag=0x7ffe0f866750 "") at pquery.c:1345
#10 0x0000000000850889 in PortalRun (portal=0x1f0a488, count=9223372036854775807, isTopLevel=1 '\001', dest=0xea5cc0 <debugtupDR>,
    altdest=0xea5cc0 <debugtupDR>, completionTag=0x7ffe0f866750 "") at pquery.c:824
#11 0x000000000084a4dc in exec_simple_query (query_string=0x1f07468 "VACUUM FREEZE;\n") at postgres.c:1113
#12 0x000000000084e960 in PostgresMain (argc=10, argv=0x1e60a50, dbname=0x1e823b0 "template1", username=0x1e672a0 "anastasia")
    at postgres.c:4091
#13 0x00000000006f967e in init_locale (categoryname=0x100000000000000 <error: Cannot access memory at address 0x100000000000000>,
    category=32766, locale=0xa004692f0 <error: Cannot access memory at address 0xa004692f0>) at main.c:310
#14 0x00007f1e5f463830 in __libc_start_main (main=0x6f93e1 <main+85>, argc=10, argv=0x7ffe0f866a78, init=<optimized out>,
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe0f866a68) at ../csu/libc-start.c:291
#15 0x0000000000469319 in _start ()

core file is quite big, so I didn't attach it to the mail. You can download it here: core dump file.

Here are some notes about the first patch:

1. prefetchBlkno = blkno & ~0x1f;
    prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;

I didn't get it what for we need these tricks. How does it differ from:
prefetchBlkno = (blkno > 32) ? blkno - 32 : 0;

2. Why do we decrease prefetchBlckno twice?

Here:
+    prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
And here:
if (prefetchBlkno >= 32)
+                prefetchBlkno -= 32;
   

I'll inspect second patch in a few days and write questions about it.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Alvaro Herrera
Date:
Anastasia Lubennikova wrote:

> I ran configure using following set of flags:
>  ./configure --enable-tap-tests --enable-cassert --enable-debug
> --enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer"
> And then ran make check. Here is the stacktrace:
> 
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
> 1417                tblk =
> ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]);

This doesn't make sense, since the patch removes the "tupindex"
variable in that function.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Tue, Dec 27, 2016 at 10:54 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Anastasia Lubennikova wrote:
>
>> I ran configure using following set of flags:
>>  ./configure --enable-tap-tests --enable-cassert --enable-debug
>> --enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer"
>> And then ran make check. Here is the stacktrace:
>>
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> #0  0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
>> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
>> 1417                tblk =
>> ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]);
>
> This doesn't make sense, since the patch removes the "tupindex"
> variable in that function.

The variable is still there. It just has a slightly different meaning
(index within the current segment, rather than global index).


On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> 23.12.2016 22:54, Claudio Freire:
>
> On Fri, Dec 23, 2016 at 1:39 PM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>
> I found the reason. I configure postgres with CFLAGS="-O0" and it causes
> Segfault on initdb.
> It works fine and passes tests with default configure flags, but I'm pretty
> sure that we should fix segfault before testing the feature.
> If you need it, I'll send a core dump.
>
> I just ran it with CFLAGS="-O0" and it passes all checks too:
>
> CFLAGS='-O0' ./configure --enable-debug --enable-cassert
> make clean && make -j8 && make check-world
>
> A stacktrace and a thorough description of your build environment
> would be helpful to understand why it breaks on your system.
>
>
> I ran configure using following set of flags:
>  ./configure --enable-tap-tests --enable-cassert --enable-debug
> --enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer"
> And then ran make check. Here is the stacktrace:

Same procedure runs fine on my end.

> core file is quite big, so I didn't attach it to the mail. You can download it here: core dump file.

Can you attach your binary as well? (it needs to be identical to be
able to inspect the core dump, and quite clearly my build is
different)

I'll keep looking for ways it could crash there, but being unable to
reproduce the crash is a big hindrance, so if you can send the binary
that could help speed things up.


On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> 1. prefetchBlkno = blkno & ~0x1f;
>     prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
>
> I didn't get it what for we need these tricks. How does it differ from:
> prefetchBlkno = (blkno > 32) ? blkno - 32 : 0;

It makes all prefetches ranges of 32 blocks aligned to 32-block boundaries.

It helps since it's at 32 block boundaries that the truncate stops to
check for locking conflicts and abort, guaranteeing no prefetch will
be needless (if it goes into that code it *will* read the next 32
blocks).

> 2. Why do we decrease prefetchBlckno twice?
>
> Here:
> +    prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
> And here:
> if (prefetchBlkno >= 32)
> +                prefetchBlkno -= 32;

The first one is outside the loop, it's finding the first prefetch
range that is boundary aligned, taking care not to cause underflow.

The second one is inside the loop, it's moving the prefetch window
down as the truncate moves along. Since it's already aligned, it
doesn't need to be realigned, just clamped to zero if it happens to
reach the bottom.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
> 1417                tblk =
> ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]);
> (gdb) bt
> #0  0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
> #1  0x0000000000693dfe in lazy_scan_heap (onerel=0x1ec2360, options=9,
> vacrelstats=0x1ef6e00, Irel=0x1ef7168, nindexes=2, aggressive=1 '\001')
>     at vacuumlazy.c:1337
> #2  0x0000000000691e66 in lazy_vacuum_rel (onerel=0x1ec2360, options=9,
> params=0x7ffe0f866310, bstrategy=0x1f1c4a8) at vacuumlazy.c:290
> #3  0x000000000069191f in vacuum_rel (relid=1247, relation=0x0, options=9,
> params=0x7ffe0f866310) at vacuum.c:1418


Those line numbers don't match my code.

Which commit are you based on?

My tree is (currently) based on 71f996d22125eb6cfdbee6094f44370aa8dec610



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Anastasia Lubennikova
Date:
27.12.2016 20:14, Claudio Freire:
On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
1417                tblk =
ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]);
(gdb) bt
#0  0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
#1  0x0000000000693dfe in lazy_scan_heap (onerel=0x1ec2360, options=9,
vacrelstats=0x1ef6e00, Irel=0x1ef7168, nindexes=2, aggressive=1 '\001')   at vacuumlazy.c:1337
#2  0x0000000000691e66 in lazy_vacuum_rel (onerel=0x1ec2360, options=9,
params=0x7ffe0f866310, bstrategy=0x1f1c4a8) at vacuumlazy.c:290
#3  0x000000000069191f in vacuum_rel (relid=1247, relation=0x0, options=9,
params=0x7ffe0f866310) at vacuum.c:1418
Those line numbers don't match my code.

Which commit are you based on?

My tree is (currently) based on 71f996d22125eb6cfdbee6094f44370aa8dec610

Hm, my branch is based on 71f996d22125eb6cfdbee6094f44370aa8dec610 as well.
I merely applied patches 0001-Vacuum-prefetch-buffers-on-backward-scan-v3.patch
and 0002-Vacuum-allow-using-more-than-1GB-work-mem-v3.patch
then ran configure and make as usual.
Am I doing something wrong?

Anyway, I found the problem that had caused segfault.

for (segindex = 0; segindex <= vacrelstats->dead_tuples.last_seg; tupindex = 0, segindex++)
{
    DeadTuplesSegment *seg = &(vacrelstats->dead_tuples.dead_tuples[segindex]);
    int            num_dead_tuples = seg->num_dead_tuples;

    while (tupindex < num_dead_tuples)
    ...

You rely on the value of tupindex here, while during the very first pass the 'tupindex' variable
may contain any garbage. And it happend that on my system there was negative value
as I found inspecting core dump:

(gdb) info locals
num_dead_tuples = 5
tottuples = 0
tupindex = -1819017215

Which leads to failure in the next line
    tblk = ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]);

The solution is to move this assignment inside the cycle.

I've read the second patch.

1. What is the reason to inline vac_cmp_itemptr() ?

2. +#define LAZY_MIN_TUPLES        Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
What does 128<<20 mean? Why not 1<<27? I'd ask you to replace it with named constant,
or at least add a comment.

I'll share my results of performance testing it in a few days.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Dec 28, 2016 at 10:26 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> 27.12.2016 20:14, Claudio Freire:
>
> On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
> 1417                tblk =
> ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]);
> (gdb) bt
> #0  0x00000000006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
> #1  0x0000000000693dfe in lazy_scan_heap (onerel=0x1ec2360, options=9,
> vacrelstats=0x1ef6e00, Irel=0x1ef7168, nindexes=2, aggressive=1 '\001')
>     at vacuumlazy.c:1337
> #2  0x0000000000691e66 in lazy_vacuum_rel (onerel=0x1ec2360, options=9,
> params=0x7ffe0f866310, bstrategy=0x1f1c4a8) at vacuumlazy.c:290
> #3  0x000000000069191f in vacuum_rel (relid=1247, relation=0x0, options=9,
> params=0x7ffe0f866310) at vacuum.c:1418
>
> Those line numbers don't match my code.
>
> Which commit are you based on?
>
> My tree is (currently) based on 71f996d22125eb6cfdbee6094f44370aa8dec610
>
>
> Hm, my branch is based on 71f996d22125eb6cfdbee6094f44370aa8dec610 as well.
> I merely applied patches
> 0001-Vacuum-prefetch-buffers-on-backward-scan-v3.patch
> and 0002-Vacuum-allow-using-more-than-1GB-work-mem-v3.patch
> then ran configure and make as usual.
> Am I doing something wrong?

Doesn't sound wrong. Maybe it's my tree the unclean one. I'll have to
do a clean checkout to verify.

> Anyway, I found the problem that had caused segfault.
>
> for (segindex = 0; segindex <= vacrelstats->dead_tuples.last_seg; tupindex =
> 0, segindex++)
> {
>     DeadTuplesSegment *seg =
> &(vacrelstats->dead_tuples.dead_tuples[segindex]);
>     int            num_dead_tuples = seg->num_dead_tuples;
>
>     while (tupindex < num_dead_tuples)
>     ...
>
> You rely on the value of tupindex here, while during the very first pass the
> 'tupindex' variable
> may contain any garbage. And it happend that on my system there was negative
> value
> as I found inspecting core dump:
>
> (gdb) info locals
> num_dead_tuples = 5
> tottuples = 0
> tupindex = -1819017215
>
> Which leads to failure in the next line
>     tblk = ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]);
>
> The solution is to move this assignment inside the cycle.

Good catch. I read that line suspecting that very same thing but
somehow I was blind to it.

> I've read the second patch.
>
> 1. What is the reason to inline vac_cmp_itemptr() ?

Performance, mostly. By inlining some transformations can be applied
that wouldn't be possible otherwise. During the binary search, this
matters performance-wise.

> 2. +#define LAZY_MIN_TUPLES        Max(MaxHeapTuplesPerPage, (128<<20) /
> sizeof(ItemPointerData))
> What does 128<<20 mean? Why not 1<<27? I'd ask you to replace it with named
> constant,
> or at least add a comment.

I tought it was more readable like that, since 1<<20 is known to be
"MB", that reads as "128 MB".

I guess I can add a comment saying so.

> I'll share my results of performance testing it in a few days.

Thanks



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Dec 28, 2016 at 3:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Anyway, I found the problem that had caused segfault.
>>
>> for (segindex = 0; segindex <= vacrelstats->dead_tuples.last_seg; tupindex =
>> 0, segindex++)
>> {
>>     DeadTuplesSegment *seg =
>> &(vacrelstats->dead_tuples.dead_tuples[segindex]);
>>     int            num_dead_tuples = seg->num_dead_tuples;
>>
>>     while (tupindex < num_dead_tuples)
>>     ...
>>
>> You rely on the value of tupindex here, while during the very first pass the
>> 'tupindex' variable
>> may contain any garbage. And it happend that on my system there was negative
>> value
>> as I found inspecting core dump:
>>
>> (gdb) info locals
>> num_dead_tuples = 5
>> tottuples = 0
>> tupindex = -1819017215
>>
>> Which leads to failure in the next line
>>     tblk = ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]);
>>
>> The solution is to move this assignment inside the cycle.
>
> Good catch. I read that line suspecting that very same thing but
> somehow I was blind to it.

Attached v4 patches with the requested fixes.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Anastasia Lubennikova
Date:
28.12.2016 23:43, Claudio Freire:
Attached v4 patches with the requested fixes.

Sorry for being late, but the tests took a lot of time.
create table t1 as select i, md5(random()::text) from generate_series(0,400000000) as i;
create index md5_idx ON  t1(md5);
update t1 set md5 = md5((random() * (100 + 500))::text);
vacuum;
Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass,
while for old version it took three passes (1GB+1GB+0.9GB).
Vacuum duration results:
vanilla:
LOG: duration: 4359006.327 ms  statement: vacuum verbose t1;
patched:
LOG: duration: 3076827.378 ms  statement: vacuum verbose t1;

We can see 30% vacuum speedup. I should note that this case can be considered
as favorable to vanilla vacuum: the table is not that big, it has just one index
and disk used is a fast fusionIO. We can expect even more gain on slower disks.

Thank you again for the patch. Hope to see it in 10.0.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Jan 19, 2017 at 6:33 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> 28.12.2016 23:43, Claudio Freire:
>
> Attached v4 patches with the requested fixes.
>
>
> Sorry for being late, but the tests took a lot of time.

I know. Takes me several days to run my test scripts once.

> create table t1 as select i, md5(random()::text) from
> generate_series(0,400000000) as i;
> create index md5_idx ON  t1(md5);
> update t1 set md5 = md5((random() * (100 + 500))::text);
> vacuum;
>
> Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass,
> while for old version it took three passes (1GB+1GB+0.9GB).
> Vacuum duration results:
>
> vanilla:
> LOG: duration: 4359006.327 ms  statement: vacuum verbose t1;
> patched:
> LOG: duration: 3076827.378 ms  statement: vacuum verbose t1;
>
> We can see 30% vacuum speedup. I should note that this case can be
> considered
> as favorable to vanilla vacuum: the table is not that big, it has just one
> index
> and disk used is a fast fusionIO. We can expect even more gain on slower
> disks.
>
> Thank you again for the patch. Hope to see it in 10.0.

Cool. Thanks for the review and the tests.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Masahiko Sawada
Date:
On Thu, Jan 19, 2017 at 8:31 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Thu, Jan 19, 2017 at 6:33 AM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> 28.12.2016 23:43, Claudio Freire:
>>
>> Attached v4 patches with the requested fixes.
>>
>>
>> Sorry for being late, but the tests took a lot of time.
>
> I know. Takes me several days to run my test scripts once.
>
>> create table t1 as select i, md5(random()::text) from
>> generate_series(0,400000000) as i;
>> create index md5_idx ON  t1(md5);
>> update t1 set md5 = md5((random() * (100 + 500))::text);
>> vacuum;
>>
>> Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass,
>> while for old version it took three passes (1GB+1GB+0.9GB).
>> Vacuum duration results:
>>
>> vanilla:
>> LOG: duration: 4359006.327 ms  statement: vacuum verbose t1;
>> patched:
>> LOG: duration: 3076827.378 ms  statement: vacuum verbose t1;
>>
>> We can see 30% vacuum speedup. I should note that this case can be
>> considered
>> as favorable to vanilla vacuum: the table is not that big, it has just one
>> index
>> and disk used is a fast fusionIO. We can expect even more gain on slower
>> disks.
>>
>> Thank you again for the patch. Hope to see it in 10.0.
>
> Cool. Thanks for the review and the tests.
>

I encountered a bug with following scenario.
1. Create table and disable autovacuum on that table.
2. Make about 200000 dead tuples on the table.
3. SET maintenance_work_mem TO 1024
4. VACUUM

@@ -729,7 +759,7 @@ lazy_scan_heap(Relation onerel, int options,
LVRelStats *vacrelstats,                        * not to reset latestRemovedXid since we want
that value to be                        * valid.                        */
-                       vacrelstats->num_dead_tuples = 0;
+                       lazy_clear_dead_tuples(vacrelstats);                       vacrelstats->num_index_scans++;
                       /* Report that we are once again scanning the heap */

I think that we should do vacrelstats->dead_tuples.num_entries = 0 as
well in lazy_clear_dead_tuples(). Once the amount of dead tuples
reached to maintenance_work_mem, lazy_scan_heap can never finish.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Alvaro Herrera
Date:
You posted two patches with this preamble:

Claudio Freire wrote:

> Attached is the raw output of the test, the script used to create it,
> and just in case the patch set used. I believe it's the same as the
> last one I posted, just rebased.

There was no discussion whatsoever of the "prefetch" patch in this
thread; and as far as I can see, nobody even mentioned such an idea in
the thread.  This prefetch patch appeared out of the blue and there was
no discussion about it that I can see.  Now I was about to push it after
some minor tweaks, and went to search where was its justification, only
to see that there was none.  Did anybody run tests with this patch?

I attach it now one more time.  My version is based on the latest
Claudio posted at
https://postgr.es/m/CAGTBQpa464RugxYwxLTtDi=Syv9GnGFcJK8uZb2fR6NDDqULaw@mail.gmail.com
I don't know if there are differences to the version first posted.
I only changed the magic number 32 to a #define, and added a
CHECK_FOR_INTERRUPTS in the prefetching loop.

FWIW, I think this patch is completely separate from the maint_work_mem
patch and should have had its own thread and its own commitfest entry.
I intend to get a look at the other patch next week, after pushing this
one.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> There was no discussion whatsoever of the "prefetch" patch in this
> thread; and as far as I can see, nobody even mentioned such an idea in
> the thread.  This prefetch patch appeared out of the blue and there was
> no discussion about it that I can see.  Now I was about to push it after
> some minor tweaks, and went to search where was its justification, only
> to see that there was none.  Did anybody run tests with this patch?
> 
> I attach it now one more time.  My version is based on the latest
> Claudio posted at
> https://postgr.es/m/CAGTBQpa464RugxYwxLTtDi=Syv9GnGFcJK8uZb2fR6NDDqULaw@mail.gmail.com
> I don't know if there are differences to the version first posted.
> I only changed the magic number 32 to a #define, and added a
> CHECK_FOR_INTERRUPTS in the prefetching loop.

Ah, I found the justification here:

https://www.postgresql.org/message-id/flat/CAGTBQpa464RugxYwxLTtDi%3DSyv9GnGFcJK8uZb2fR6NDDqULaw%40mail.gmail.com#CAGTBQpbayY-t5-ySW19yQs1dBqvV6dm8dmdpTv_FWXmDC0A0cQ%40mail.gmail.com
apparently the truncate scan is 4x-6x faster with this prefetching.
Nice!

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Alvaro Herrera
Date:
I pushed this patch after rewriting it rather completely.  I added
tracing notices to inspect the blocks it was prefetching and observed
that the original coding was failing to prefetch the final streak of
blocks in the table, which is an important oversight considering that it
may very well be that those are the only blocks to read at all.

I timed vacuuming a 4000-block table in my laptop (single SSD disk;
dropped FS caches after deleting all rows in table, so that vacuum has
to read all blocks from disk); it changes from 387ms without patch to
155ms with patch.  I didn't measure how much it takes to run the other
steps in the vacuum, but it's clear that the speedup for the truncation
phase is considerable.

¡Thanks, Claudio!

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Alvaro Herrera
Date:
I think this patch no longer applies because of conflicts with the one I
just pushed.  Please rebase.

Thanks

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Fri, Jan 20, 2017 at 6:24 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Thu, Jan 19, 2017 at 8:31 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Thu, Jan 19, 2017 at 6:33 AM, Anastasia Lubennikova
>> <a.lubennikova@postgrespro.ru> wrote:
>>> 28.12.2016 23:43, Claudio Freire:
>>>
>>> Attached v4 patches with the requested fixes.
>>>
>>>
>>> Sorry for being late, but the tests took a lot of time.
>>
>> I know. Takes me several days to run my test scripts once.
>>
>>> create table t1 as select i, md5(random()::text) from
>>> generate_series(0,400000000) as i;
>>> create index md5_idx ON  t1(md5);
>>> update t1 set md5 = md5((random() * (100 + 500))::text);
>>> vacuum;
>>>
>>> Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass,
>>> while for old version it took three passes (1GB+1GB+0.9GB).
>>> Vacuum duration results:
>>>
>>> vanilla:
>>> LOG: duration: 4359006.327 ms  statement: vacuum verbose t1;
>>> patched:
>>> LOG: duration: 3076827.378 ms  statement: vacuum verbose t1;
>>>
>>> We can see 30% vacuum speedup. I should note that this case can be
>>> considered
>>> as favorable to vanilla vacuum: the table is not that big, it has just one
>>> index
>>> and disk used is a fast fusionIO. We can expect even more gain on slower
>>> disks.
>>>
>>> Thank you again for the patch. Hope to see it in 10.0.
>>
>> Cool. Thanks for the review and the tests.
>>
>
> I encountered a bug with following scenario.
> 1. Create table and disable autovacuum on that table.
> 2. Make about 200000 dead tuples on the table.
> 3. SET maintenance_work_mem TO 1024
> 4. VACUUM
>
> @@ -729,7 +759,7 @@ lazy_scan_heap(Relation onerel, int options,
> LVRelStats *vacrelstats,
>                          * not to reset latestRemovedXid since we want
> that value to be
>                          * valid.
>                          */
> -                       vacrelstats->num_dead_tuples = 0;
> +                       lazy_clear_dead_tuples(vacrelstats);
>                         vacrelstats->num_index_scans++;
>
>                         /* Report that we are once again scanning the heap */
>
> I think that we should do vacrelstats->dead_tuples.num_entries = 0 as
> well in lazy_clear_dead_tuples(). Once the amount of dead tuples
> reached to maintenance_work_mem, lazy_scan_heap can never finish.

That's right.

I added a test for it in the attached patch set, which uncovered
another bug in lazy_clear_dead_tuples, and took the opportunity to
rebase.

On Mon, Jan 23, 2017 at 1:06 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> I pushed this patch after rewriting it rather completely.  I added
> tracing notices to inspect the blocks it was prefetching and observed
> that the original coding was failing to prefetch the final streak of
> blocks in the table, which is an important oversight considering that it
> may very well be that those are the only blocks to read at all.
>
> I timed vacuuming a 4000-block table in my laptop (single SSD disk;
> dropped FS caches after deleting all rows in table, so that vacuum has
> to read all blocks from disk); it changes from 387ms without patch to
> 155ms with patch.  I didn't measure how much it takes to run the other
> steps in the vacuum, but it's clear that the speedup for the truncation
> phase is considerable.
>
> ĄThanks, Claudio!

Cool.

Though it wasn't the first time this idea has been floating around, I
can't take all the credit.


On Fri, Jan 20, 2017 at 6:25 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> FWIW, I think this patch is completely separate from the maint_work_mem
> patch and should have had its own thread and its own commitfest entry.
> I intend to get a look at the other patch next week, after pushing this
> one.

That's because it did have it, and was left in limbo due to lack of
testing on SSDs. I just had to adopt it here because otherwise tests
took way too long.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Masahiko Sawada
Date:
On Tue, Jan 24, 2017 at 1:49 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Fri, Jan 20, 2017 at 6:24 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Thu, Jan 19, 2017 at 8:31 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> On Thu, Jan 19, 2017 at 6:33 AM, Anastasia Lubennikova
>>> <a.lubennikova@postgrespro.ru> wrote:
>>>> 28.12.2016 23:43, Claudio Freire:
>>>>
>>>> Attached v4 patches with the requested fixes.
>>>>
>>>>
>>>> Sorry for being late, but the tests took a lot of time.
>>>
>>> I know. Takes me several days to run my test scripts once.
>>>
>>>> create table t1 as select i, md5(random()::text) from
>>>> generate_series(0,400000000) as i;
>>>> create index md5_idx ON  t1(md5);
>>>> update t1 set md5 = md5((random() * (100 + 500))::text);
>>>> vacuum;
>>>>
>>>> Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass,
>>>> while for old version it took three passes (1GB+1GB+0.9GB).
>>>> Vacuum duration results:
>>>>
>>>> vanilla:
>>>> LOG: duration: 4359006.327 ms  statement: vacuum verbose t1;
>>>> patched:
>>>> LOG: duration: 3076827.378 ms  statement: vacuum verbose t1;
>>>>
>>>> We can see 30% vacuum speedup. I should note that this case can be
>>>> considered
>>>> as favorable to vanilla vacuum: the table is not that big, it has just one
>>>> index
>>>> and disk used is a fast fusionIO. We can expect even more gain on slower
>>>> disks.
>>>>
>>>> Thank you again for the patch. Hope to see it in 10.0.
>>>
>>> Cool. Thanks for the review and the tests.
>>>
>>
>> I encountered a bug with following scenario.
>> 1. Create table and disable autovacuum on that table.
>> 2. Make about 200000 dead tuples on the table.
>> 3. SET maintenance_work_mem TO 1024
>> 4. VACUUM
>>
>> @@ -729,7 +759,7 @@ lazy_scan_heap(Relation onerel, int options,
>> LVRelStats *vacrelstats,
>>                          * not to reset latestRemovedXid since we want
>> that value to be
>>                          * valid.
>>                          */
>> -                       vacrelstats->num_dead_tuples = 0;
>> +                       lazy_clear_dead_tuples(vacrelstats);
>>                         vacrelstats->num_index_scans++;
>>
>>                         /* Report that we are once again scanning the heap */
>>
>> I think that we should do vacrelstats->dead_tuples.num_entries = 0 as
>> well in lazy_clear_dead_tuples(). Once the amount of dead tuples
>> reached to maintenance_work_mem, lazy_scan_heap can never finish.
>
> That's right.
>
> I added a test for it in the attached patch set, which uncovered
> another bug in lazy_clear_dead_tuples, and took the opportunity to
> rebase.
>
> On Mon, Jan 23, 2017 at 1:06 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> I pushed this patch after rewriting it rather completely.  I added
>> tracing notices to inspect the blocks it was prefetching and observed
>> that the original coding was failing to prefetch the final streak of
>> blocks in the table, which is an important oversight considering that it
>> may very well be that those are the only blocks to read at all.
>>
>> I timed vacuuming a 4000-block table in my laptop (single SSD disk;
>> dropped FS caches after deleting all rows in table, so that vacuum has
>> to read all blocks from disk); it changes from 387ms without patch to
>> 155ms with patch.  I didn't measure how much it takes to run the other
>> steps in the vacuum, but it's clear that the speedup for the truncation
>> phase is considerable.
>>
>> ĄThanks, Claudio!
>
> Cool.
>
> Though it wasn't the first time this idea has been floating around, I
> can't take all the credit.
>
>
> On Fri, Jan 20, 2017 at 6:25 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> FWIW, I think this patch is completely separate from the maint_work_mem
>> patch and should have had its own thread and its own commitfest entry.
>> I intend to get a look at the other patch next week, after pushing this
>> one.
>
> That's because it did have it, and was left in limbo due to lack of
> testing on SSDs. I just had to adopt it here because otherwise tests
> took way too long.

Thank you for updating the patch!

+       /*
+        * Quickly rule out by lower bound (should happen a lot) Upper bound was
+        * already checked by segment search
+        */
+       if (vac_cmp_itemptr((void *) itemptr, (void *) rseg->dead_tuples) < 0)
+               return false;

I think that if the above result is 0, we can return true as itemptr
matched lower bound item pointer in rseg->dead_tuples.
+typedef struct DeadTuplesSegment
+{
+       int                     num_dead_tuples;        /* # of
entries in the segment */
+       int                     max_dead_tuples;        /* # of
entries allocated in the segment */
+       ItemPointerData last_dead_tuple;        /* Copy of the last
dead tuple (unset
+         * until the segment is fully
+         * populated) */
+       unsigned short padding;
+       ItemPointer dead_tuples;        /* Array of dead tuples */
+}      DeadTuplesSegment;
+
+typedef struct DeadTuplesMultiArray
+{
+       int                     num_entries;    /* current # of entries */
+       int                     max_entries;    /* total # of slots
that can be allocated in
+                                                                * array */
+       int                     num_segs;               /* number of
dead tuple segments allocated */
+       int                     last_seg;               /* last dead
tuple segment with data (or 0) */
+       DeadTuplesSegment *dead_tuples;         /* array of num_segs segments */
+}      DeadTuplesMultiArray;

It's a matter of personal preference but some same dead_tuples
variables having different meaning confused me.
If we want to access first dead tuple location of first segment, we
need to do 'vacrelstats->dead_tuples.dead_tuples.dead_tuples'. For
example, 'vacrelstats->dead_tuples.dt_segment.dt_array' is better to
me.


+                                       nseg->num_dead_tuples = 0;
+                                       nseg->max_dead_tuples = 0;
+                                       nseg->dead_tuples = NULL;
+                                       vacrelstats->dead_tuples.num_segs++;
+                               }
+                               seg = DeadTuplesCurrentSegment(vacrelstats);
+                       }
+                       vacrelstats->dead_tuples.last_seg++;
+                       seg = DeadTuplesCurrentSegment(vacrelstats);

Because seg is always set later I think the first line starting with
"seg = ..." is not necessary. Thought?

--
Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Jan 25, 2017 at 1:54 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thank you for updating the patch!
>
> +       /*
> +        * Quickly rule out by lower bound (should happen a lot) Upper bound was
> +        * already checked by segment search
> +        */
> +       if (vac_cmp_itemptr((void *) itemptr, (void *) rseg->dead_tuples) < 0)
> +               return false;
>
> I think that if the above result is 0, we can return true as itemptr
> matched lower bound item pointer in rseg->dead_tuples.

That's right. Possibly not a great speedup but... why not?

>
>  +typedef struct DeadTuplesSegment
> +{
> +       int                     num_dead_tuples;        /* # of
> entries in the segment */
> +       int                     max_dead_tuples;        /* # of
> entries allocated in the segment */
> +       ItemPointerData last_dead_tuple;        /* Copy of the last
> dead tuple (unset
> +
>           * until the segment is fully
> +
>           * populated) */
> +       unsigned short padding;
> +       ItemPointer dead_tuples;        /* Array of dead tuples */
> +}      DeadTuplesSegment;
> +
> +typedef struct DeadTuplesMultiArray
> +{
> +       int                     num_entries;    /* current # of entries */
> +       int                     max_entries;    /* total # of slots
> that can be allocated in
> +                                                                * array */
> +       int                     num_segs;               /* number of
> dead tuple segments allocated */
> +       int                     last_seg;               /* last dead
> tuple segment with data (or 0) */
> +       DeadTuplesSegment *dead_tuples;         /* array of num_segs segments */
> +}      DeadTuplesMultiArray;
>
> It's a matter of personal preference but some same dead_tuples
> variables having different meaning confused me.
> If we want to access first dead tuple location of first segment, we
> need to do 'vacrelstats->dead_tuples.dead_tuples.dead_tuples'. For
> example, 'vacrelstats->dead_tuples.dt_segment.dt_array' is better to
> me.

Yes, I can see how that could be confusing.

I went for vacrelstats->dead_tuples.dt_segments[i].dt_tids[j]

> +                                       nseg->num_dead_tuples = 0;
> +                                       nseg->max_dead_tuples = 0;
> +                                       nseg->dead_tuples = NULL;
> +                                       vacrelstats->dead_tuples.num_segs++;
> +                               }
> +                               seg = DeadTuplesCurrentSegment(vacrelstats);
> +                       }
> +                       vacrelstats->dead_tuples.last_seg++;
> +                       seg = DeadTuplesCurrentSegment(vacrelstats);
>
> Because seg is always set later I think the first line starting with
> "seg = ..." is not necessary. Thought?

That's correct.

Attached a v6 with those changes (and rebased).

Make check still passes.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Masahiko Sawada
Date:
On Thu, Jan 26, 2017 at 5:11 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Wed, Jan 25, 2017 at 1:54 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Thank you for updating the patch!
>>
>> +       /*
>> +        * Quickly rule out by lower bound (should happen a lot) Upper bound was
>> +        * already checked by segment search
>> +        */
>> +       if (vac_cmp_itemptr((void *) itemptr, (void *) rseg->dead_tuples) < 0)
>> +               return false;
>>
>> I think that if the above result is 0, we can return true as itemptr
>> matched lower bound item pointer in rseg->dead_tuples.
>
> That's right. Possibly not a great speedup but... why not?
>
>>
>>  +typedef struct DeadTuplesSegment
>> +{
>> +       int                     num_dead_tuples;        /* # of
>> entries in the segment */
>> +       int                     max_dead_tuples;        /* # of
>> entries allocated in the segment */
>> +       ItemPointerData last_dead_tuple;        /* Copy of the last
>> dead tuple (unset
>> +
>>           * until the segment is fully
>> +
>>           * populated) */
>> +       unsigned short padding;
>> +       ItemPointer dead_tuples;        /* Array of dead tuples */
>> +}      DeadTuplesSegment;
>> +
>> +typedef struct DeadTuplesMultiArray
>> +{
>> +       int                     num_entries;    /* current # of entries */
>> +       int                     max_entries;    /* total # of slots
>> that can be allocated in
>> +                                                                * array */
>> +       int                     num_segs;               /* number of
>> dead tuple segments allocated */
>> +       int                     last_seg;               /* last dead
>> tuple segment with data (or 0) */
>> +       DeadTuplesSegment *dead_tuples;         /* array of num_segs segments */
>> +}      DeadTuplesMultiArray;
>>
>> It's a matter of personal preference but some same dead_tuples
>> variables having different meaning confused me.
>> If we want to access first dead tuple location of first segment, we
>> need to do 'vacrelstats->dead_tuples.dead_tuples.dead_tuples'. For
>> example, 'vacrelstats->dead_tuples.dt_segment.dt_array' is better to
>> me.
>
> Yes, I can see how that could be confusing.
>
> I went for vacrelstats->dead_tuples.dt_segments[i].dt_tids[j]

Thank you for updating.
Looks good to me.

>> +                                       nseg->num_dead_tuples = 0;
>> +                                       nseg->max_dead_tuples = 0;
>> +                                       nseg->dead_tuples = NULL;
>> +                                       vacrelstats->dead_tuples.num_segs++;
>> +                               }
>> +                               seg = DeadTuplesCurrentSegment(vacrelstats);
>> +                       }
>> +                       vacrelstats->dead_tuples.last_seg++;
>> +                       seg = DeadTuplesCurrentSegment(vacrelstats);
>>
>> Because seg is always set later I think the first line starting with
>> "seg = ..." is not necessary. Thought?
>
> That's correct.
>
> Attached a v6 with those changes (and rebased).
>
> Make check still passes.

Here is review comment of v6 patch.

----* We are willing to use at most maintenance_work_mem (or perhaps* autovacuum_work_mem) memory space to keep track
ofdead tuples.  We* initially allocate an array of TIDs of that size, with an upper limit that* depends on table size
(thislimit ensures we don't allocate a huge area* uselessly for vacuuming small tables).  If the array threatens to
overflow,

I think that we need to update the above paragraph comment at top of
vacuumlazy.c file.

----
+                               numtuples = Max(numtuples,
MaxHeapTuplesPerPage);
+                               numtuples = Min(numtuples, INT_MAX / 2);
+                               numtuples = Min(numtuples, 2 *
pseg->max_dead_tuples);
+                               numtuples = Min(numtuples,
MaxAllocSize / sizeof(ItemPointerData));
+                               seg->dt_tids = (ItemPointer)
palloc(sizeof(ItemPointerData) * numtuples);

Why numtuples is limited to "INT_MAX / 2" but not INT_MAX?

----
@@ -1376,35 +1411,43 @@ lazy_vacuum_heap(Relation onerel, LVRelStats
*vacrelstats)       pg_rusage_init(&ru0);       npages = 0;

-       tupindex = 0;
-       while (tupindex < vacrelstats->num_dead_tuples)
+       segindex = 0;
+       tottuples = 0;
+       for (segindex = tupindex = 0; segindex <=
vacrelstats->dead_tuples.last_seg; tupindex = 0, segindex++)       {
-               BlockNumber tblk;
-               Buffer          buf;
-               Page            page;
-               Size            freespace;

This is a minute thing but tupindex can be define inside of for loop.

----
@@ -1129,10 +1159,13 @@ lazy_scan_heap(Relation onerel, int options,
LVRelStats *vacrelstats,         * instead of doing a second scan.         */        if (nindexes == 0 &&
-            vacrelstats->num_dead_tuples > 0)
+            vacrelstats->dead_tuples.num_entries > 0)        {            /* Remove tuples from heap */
-            lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+            Assert(vacrelstats->dead_tuples.last_seg == 0);        /*
Should not need more
+                                                                 *
than one segment per
+                                                                 * page */

I'm not sure we need to add Assert() here but it seems to me that the
comment and code is not properly correspond and the comment for
Assert() should be wrote above of Assert() line.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Mon, Jan 30, 2017 at 5:51 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> ----
>  * We are willing to use at most maintenance_work_mem (or perhaps
>  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>  * initially allocate an array of TIDs of that size, with an upper limit that
>  * depends on table size (this limit ensures we don't allocate a huge area
>  * uselessly for vacuuming small tables).  If the array threatens to overflow,
>
> I think that we need to update the above paragraph comment at top of
> vacuumlazy.c file.

Indeed, I missed that one. Fixing.

>
> ----
> +                               numtuples = Max(numtuples,
> MaxHeapTuplesPerPage);
> +                               numtuples = Min(numtuples, INT_MAX / 2);
> +                               numtuples = Min(numtuples, 2 *
> pseg->max_dead_tuples);
> +                               numtuples = Min(numtuples,
> MaxAllocSize / sizeof(ItemPointerData));
> +                               seg->dt_tids = (ItemPointer)
> palloc(sizeof(ItemPointerData) * numtuples);
>
> Why numtuples is limited to "INT_MAX / 2" but not INT_MAX?

I forgot to mention this one in the OP.

Googling around, I found out some implemetations of bsearch break with
array sizes beyond INT_MAX/2 [1] (they'd overflow when computing the
midpoint).

Before this patch, this bsearch call had no way of reaching that size.
An initial version of the patch (the one that allocated a big array
with huge allocation) could reach that point, though, so I reduced the
limit to play it safe. This latest version is back to the starting
point, since it cannot allocate segments bigger than 1GB, but I opted
to keep playing it safe and leave the reduced limit just in case.

> ----
> @@ -1376,35 +1411,43 @@ lazy_vacuum_heap(Relation onerel, LVRelStats
> *vacrelstats)
>         pg_rusage_init(&ru0);
>         npages = 0;
>
> -       tupindex = 0;
> -       while (tupindex < vacrelstats->num_dead_tuples)
> +       segindex = 0;
> +       tottuples = 0;
> +       for (segindex = tupindex = 0; segindex <=
> vacrelstats->dead_tuples.last_seg; tupindex = 0, segindex++)
>         {
> -               BlockNumber tblk;
> -               Buffer          buf;
> -               Page            page;
> -               Size            freespace;
>
> This is a minute thing but tupindex can be define inside of for loop.

Right, changing.

>
> ----
> @@ -1129,10 +1159,13 @@ lazy_scan_heap(Relation onerel, int options,
> LVRelStats *vacrelstats,
>           * instead of doing a second scan.
>           */
>          if (nindexes == 0 &&
> -            vacrelstats->num_dead_tuples > 0)
> +            vacrelstats->dead_tuples.num_entries > 0)
>          {
>              /* Remove tuples from heap */
> -            lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
> +            Assert(vacrelstats->dead_tuples.last_seg == 0);        /*
> Should not need more
> +                                                                 *
> than one segment per
> +                                                                 * page */
>
> I'm not sure we need to add Assert() here but it seems to me that the
> comment and code is not properly correspond and the comment for
> Assert() should be wrote above of Assert() line.

Well, that assert is the one that found the second bug in
lazy_clear_dead_tuples, so clearly it's not without merit.

I'll rearrange the comments as you ask though.


Updated and rebased v7 attached.


[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=776671

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Michael Paquier
Date:
On Tue, Jan 31, 2017 at 11:05 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Updated and rebased v7 attached.

Moved to CF 2017-03.
-- 
Michael



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Masahiko Sawada
Date:
On Tue, Jan 31, 2017 at 3:05 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Mon, Jan 30, 2017 at 5:51 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> ----
>>  * We are willing to use at most maintenance_work_mem (or perhaps
>>  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>>  * initially allocate an array of TIDs of that size, with an upper limit that
>>  * depends on table size (this limit ensures we don't allocate a huge area
>>  * uselessly for vacuuming small tables).  If the array threatens to overflow,
>>
>> I think that we need to update the above paragraph comment at top of
>> vacuumlazy.c file.
>
> Indeed, I missed that one. Fixing.
>
>>
>> ----
>> +                               numtuples = Max(numtuples,
>> MaxHeapTuplesPerPage);
>> +                               numtuples = Min(numtuples, INT_MAX / 2);
>> +                               numtuples = Min(numtuples, 2 *
>> pseg->max_dead_tuples);
>> +                               numtuples = Min(numtuples,
>> MaxAllocSize / sizeof(ItemPointerData));
>> +                               seg->dt_tids = (ItemPointer)
>> palloc(sizeof(ItemPointerData) * numtuples);
>>
>> Why numtuples is limited to "INT_MAX / 2" but not INT_MAX?
>
> I forgot to mention this one in the OP.
>
> Googling around, I found out some implemetations of bsearch break with
> array sizes beyond INT_MAX/2 [1] (they'd overflow when computing the
> midpoint).
>
> Before this patch, this bsearch call had no way of reaching that size.
> An initial version of the patch (the one that allocated a big array
> with huge allocation) could reach that point, though, so I reduced the
> limit to play it safe. This latest version is back to the starting
> point, since it cannot allocate segments bigger than 1GB, but I opted
> to keep playing it safe and leave the reduced limit just in case.
>

Thanks, I understood.

>> ----
>> @@ -1376,35 +1411,43 @@ lazy_vacuum_heap(Relation onerel, LVRelStats
>> *vacrelstats)
>>         pg_rusage_init(&ru0);
>>         npages = 0;
>>
>> -       tupindex = 0;
>> -       while (tupindex < vacrelstats->num_dead_tuples)
>> +       segindex = 0;
>> +       tottuples = 0;
>> +       for (segindex = tupindex = 0; segindex <=
>> vacrelstats->dead_tuples.last_seg; tupindex = 0, segindex++)
>>         {
>> -               BlockNumber tblk;
>> -               Buffer          buf;
>> -               Page            page;
>> -               Size            freespace;
>>
>> This is a minute thing but tupindex can be define inside of for loop.
>
> Right, changing.
>
>>
>> ----
>> @@ -1129,10 +1159,13 @@ lazy_scan_heap(Relation onerel, int options,
>> LVRelStats *vacrelstats,
>>           * instead of doing a second scan.
>>           */
>>          if (nindexes == 0 &&
>> -            vacrelstats->num_dead_tuples > 0)
>> +            vacrelstats->dead_tuples.num_entries > 0)
>>          {
>>              /* Remove tuples from heap */
>> -            lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
>> +            Assert(vacrelstats->dead_tuples.last_seg == 0);        /*
>> Should not need more
>> +                                                                 *
>> than one segment per
>> +                                                                 * page */
>>
>> I'm not sure we need to add Assert() here but it seems to me that the
>> comment and code is not properly correspond and the comment for
>> Assert() should be wrote above of Assert() line.
>
> Well, that assert is the one that found the second bug in
> lazy_clear_dead_tuples, so clearly it's not without merit.
>
> I'll rearrange the comments as you ask though.
>
>
> Updated and rebased v7 attached.
>
>
> [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=776671

Thank you for updating the patch.

Whole patch looks good to me except for the following one comment.
This is the final comment from me.

/**  lazy_tid_reaped() -- is a particular tid deletable?**      This has the right signature to be an
IndexBulkDeleteCallback.**     Assumes dead_tuples array is in sorted order.*/
 
static bool
lazy_tid_reaped(ItemPointer itemptr, void *state)
{   LVRelStats *vacrelstats = (LVRelStats *) state;

You might want to update the comment of lazy_tid_reaped() as well.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thank you for updating the patch.
>
> Whole patch looks good to me except for the following one comment.
> This is the final comment from me.
>
> /*
>  *  lazy_tid_reaped() -- is a particular tid deletable?
>  *
>  *      This has the right signature to be an IndexBulkDeleteCallback.
>  *
>  *      Assumes dead_tuples array is in sorted order.
>  */
> static bool
> lazy_tid_reaped(ItemPointer itemptr, void *state)
> {
>     LVRelStats *vacrelstats = (LVRelStats *) state;
>
> You might want to update the comment of lazy_tid_reaped() as well.

I don't see the mismatch with reality there (if you consider
"dead_tples array" in the proper context, that is, the multiarray).

What in particular do you find out of sync there?



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Masahiko Sawada
Date:
On Wed, Feb 1, 2017 at 10:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Thank you for updating the patch.
>>
>> Whole patch looks good to me except for the following one comment.
>> This is the final comment from me.
>>
>> /*
>>  *  lazy_tid_reaped() -- is a particular tid deletable?
>>  *
>>  *      This has the right signature to be an IndexBulkDeleteCallback.
>>  *
>>  *      Assumes dead_tuples array is in sorted order.
>>  */
>> static bool
>> lazy_tid_reaped(ItemPointer itemptr, void *state)
>> {
>>     LVRelStats *vacrelstats = (LVRelStats *) state;
>>
>> You might want to update the comment of lazy_tid_reaped() as well.
>
> I don't see the mismatch with reality there (if you consider
> "dead_tples array" in the proper context, that is, the multiarray).
>
> What in particular do you find out of sync there?

The current lazy_tid_reaped just find a tid from a tid array using
bsearch but in your patch lazy_tid_reaped handles multiple tid arrays
and processing method become complicated. So I thought it's better to
add the description of this function.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Feb 1, 2017 at 6:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Feb 1, 2017 at 10:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> Thank you for updating the patch.
>>>
>>> Whole patch looks good to me except for the following one comment.
>>> This is the final comment from me.
>>>
>>> /*
>>>  *  lazy_tid_reaped() -- is a particular tid deletable?
>>>  *
>>>  *      This has the right signature to be an IndexBulkDeleteCallback.
>>>  *
>>>  *      Assumes dead_tuples array is in sorted order.
>>>  */
>>> static bool
>>> lazy_tid_reaped(ItemPointer itemptr, void *state)
>>> {
>>>     LVRelStats *vacrelstats = (LVRelStats *) state;
>>>
>>> You might want to update the comment of lazy_tid_reaped() as well.
>>
>> I don't see the mismatch with reality there (if you consider
>> "dead_tples array" in the proper context, that is, the multiarray).
>>
>> What in particular do you find out of sync there?
>
> The current lazy_tid_reaped just find a tid from a tid array using
> bsearch but in your patch lazy_tid_reaped handles multiple tid arrays
> and processing method become complicated. So I thought it's better to
> add the description of this function.

Alright, updated with some more remarks that seemed relevant

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Masahiko Sawada
Date:
On Wed, Feb 1, 2017 at 11:55 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Wed, Feb 1, 2017 at 6:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Wed, Feb 1, 2017 at 10:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> Thank you for updating the patch.
>>>>
>>>> Whole patch looks good to me except for the following one comment.
>>>> This is the final comment from me.
>>>>
>>>> /*
>>>>  *  lazy_tid_reaped() -- is a particular tid deletable?
>>>>  *
>>>>  *      This has the right signature to be an IndexBulkDeleteCallback.
>>>>  *
>>>>  *      Assumes dead_tuples array is in sorted order.
>>>>  */
>>>> static bool
>>>> lazy_tid_reaped(ItemPointer itemptr, void *state)
>>>> {
>>>>     LVRelStats *vacrelstats = (LVRelStats *) state;
>>>>
>>>> You might want to update the comment of lazy_tid_reaped() as well.
>>>
>>> I don't see the mismatch with reality there (if you consider
>>> "dead_tples array" in the proper context, that is, the multiarray).
>>>
>>> What in particular do you find out of sync there?
>>
>> The current lazy_tid_reaped just find a tid from a tid array using
>> bsearch but in your patch lazy_tid_reaped handles multiple tid arrays
>> and processing method become complicated. So I thought it's better to
>> add the description of this function.
>
> Alright, updated with some more remarks that seemed relevant

Thank you for updating the patch.

The patch looks good to me. There is no review comment from me.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Feb 1, 2017 at 7:55 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Wed, Feb 1, 2017 at 6:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Wed, Feb 1, 2017 at 10:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> Thank you for updating the patch.
>>>>
>>>> Whole patch looks good to me except for the following one comment.
>>>> This is the final comment from me.
>>>>
>>>> /*
>>>>  *  lazy_tid_reaped() -- is a particular tid deletable?
>>>>  *
>>>>  *      This has the right signature to be an IndexBulkDeleteCallback.
>>>>  *
>>>>  *      Assumes dead_tuples array is in sorted order.
>>>>  */
>>>> static bool
>>>> lazy_tid_reaped(ItemPointer itemptr, void *state)
>>>> {
>>>>     LVRelStats *vacrelstats = (LVRelStats *) state;
>>>>
>>>> You might want to update the comment of lazy_tid_reaped() as well.
>>>
>>> I don't see the mismatch with reality there (if you consider
>>> "dead_tples array" in the proper context, that is, the multiarray).
>>>
>>> What in particular do you find out of sync there?
>>
>> The current lazy_tid_reaped just find a tid from a tid array using
>> bsearch but in your patch lazy_tid_reaped handles multiple tid arrays
>> and processing method become complicated. So I thought it's better to
>> add the description of this function.
>
> Alright, updated with some more remarks that seemed relevant

I just realized I never updated the early free patch after the
multiarray version.

So attached is a patch that frees the multiarray as early as possible
(just after finishing with index bulk deletes, right before doing
index cleanup and attempting truncation).

This should make the possibly big amount of memory available to other
processes for the duration of those tasks, which could be a long time
in some cases.

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Andres Freund
Date:
Hi,

I've *not* read the history of this thread.  So I really might be
missing some context.


> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
> From: Claudio Freire <klaussfreire@gmail.com>
> Date: Mon, 12 Sep 2016 23:36:42 -0300
> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
> 
> Turn the dead_tuples array into a structure composed of several
> exponentially bigger arrays, to enable usage of more than 1GB
> of work mem during vacuum and thus reduce the number of full
> index scans necessary to remove all dead tids when the memory is
> available.

>   * We are willing to use at most maintenance_work_mem (or perhaps
>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
> - * initially allocate an array of TIDs of that size, with an upper limit that
> + * initially allocate an array of TIDs of 128MB, or an upper limit that
>   * depends on table size (this limit ensures we don't allocate a huge area
> - * uselessly for vacuuming small tables).  If the array threatens to overflow,
> - * we suspend the heap scan phase and perform a pass of index cleanup and page
> - * compaction, then resume the heap scan with an empty TID array.
> + * uselessly for vacuuming small tables). Additional arrays of increasingly
> + * large sizes are allocated as they become necessary.
> + *
> + * The TID array is thus represented as a list of multiple segments of
> + * varying size, beginning with the initial size of up to 128MB, and growing
> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
> + * is used up.

When the chunk size is 128MB, I'm a bit unconvinced that using
exponential growth is worth it. The allocator overhead can't be
meaningful in comparison to collecting 128MB dead tuples, the potential
waste is pretty big, and it increases memory fragmentation.


> + * Lookup in that structure proceeds sequentially in the list of segments,
> + * and with a binary search within each segment. Since segment's size grows
> + * exponentially, this retains O(N log N) lookup complexity.

N log N is a horrible lookup complexity.  That's the complexity of
*sorting* an entire array.  I think you might be trying to argue that
it's log(N) * log(N)? Once log(n) for the exponentially growing size of
segments, one for the binary search?

Afaics you could quite easily make it O(2 log(N)) by simply also doing
binary search over the segments.  Might not be worth it due to the small
constant involved normally.


> + * If the array threatens to overflow, we suspend the heap scan phase and
> + * perform a pass of index cleanup and page compaction, then resume the heap
> + * scan with an array of logically empty but already preallocated TID segments
> + * to be refilled with more dead tuple TIDs.

Hm, it's not really the array that overflows, it's m_w_m that'd be
exceeded, right?


>  /*
> + * Minimum (starting) size of the dead_tuples array segments. Will allocate
> + * space for 128MB worth of tid pointers in the first segment, further segments
> + * will grow in size exponentially. Don't make it too small or the segment list
> + * will grow bigger than the sweetspot for search efficiency on big vacuums.
> + */
> +#define LAZY_MIN_TUPLES        Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))

That's not really the minimum, no? s/MIN/INIT/?


> +typedef struct DeadTuplesSegment
> +{
> +    int            num_dead_tuples;    /* # of entries in the segment */
> +    int            max_dead_tuples;    /* # of entries allocated in the segment */
> +    ItemPointerData last_dead_tuple;    /* Copy of the last dead tuple (unset
> +                                         * until the segment is fully
> +                                         * populated) */
> +    unsigned short padding;
> +    ItemPointer dt_tids;    /* Array of dead tuples */
> +}    DeadTuplesSegment;

Whenever padding is needed, it should have an explanatory comment.  It's
certainly not obvious to me wh it's neede here.


> @@ -1598,6 +1657,11 @@ lazy_vacuum_index(Relation indrel,
>      ivinfo.num_heap_tuples = vacrelstats->old_rel_tuples;
>      ivinfo.strategy = vac_strategy;
>  
> +    /* Finalize the current segment by setting its upper bound dead tuple */
> +    seg = DeadTuplesCurrentSegment(vacrelstats);
> +    if (seg->num_dead_tuples > 0)
> +        seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1];

Why don't we just maintain this here, for all of the segments?  Seems a
bit easier.


> @@ -1973,7 +2037,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
>  static void
>  lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
>  {
> -    long        maxtuples;
> +    long        maxtuples,
> +                mintuples;
>      int            vac_work_mem = IsAutoVacuumWorkerProcess() &&
>      autovacuum_work_mem != -1 ?
>      autovacuum_work_mem : maintenance_work_mem;
> @@ -1982,7 +2047,6 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
>      {
>          maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
>          maxtuples = Min(maxtuples, INT_MAX);
> -        maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
>  
>          /* curious coding here to ensure the multiplication can't overflow */
>          if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
> @@ -1996,10 +2060,18 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
>          maxtuples = MaxHeapTuplesPerPage;
>      }
>  
> -    vacrelstats->num_dead_tuples = 0;
> -    vacrelstats->max_dead_tuples = (int) maxtuples;
> -    vacrelstats->dead_tuples = (ItemPointer)
> -        palloc(maxtuples * sizeof(ItemPointerData));
> +    mintuples = Min(LAZY_MIN_TUPLES, maxtuples);
> +
> +    vacrelstats->dead_tuples.num_entries = 0;
> +    vacrelstats->dead_tuples.max_entries = (int) maxtuples;
> +    vacrelstats->dead_tuples.num_segs = 1;
> +    vacrelstats->dead_tuples.last_seg = 0;
> +    vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *)
> +        palloc(sizeof(DeadTuplesSegment));
> +    vacrelstats->dead_tuples.dt_segments[0].dt_tids = (ItemPointer)
> +        palloc(mintuples * sizeof(ItemPointerData));
> +    vacrelstats->dead_tuples.dt_segments[0].max_dead_tuples = mintuples;
> +    vacrelstats->dead_tuples.dt_segments[0].num_dead_tuples = 0;
>  }

Hm. Why don't we delay allocating dt_segments[0] till we actually need
it?  It's not uncommon for vacuums not to be able to find any dead
tuples, and it'd not change code in lazy_record_dead_tuple() much.


> @@ -2014,31 +2086,147 @@ lazy_record_dead_tuple(LVRelStats *vacrelstats,
>       * could if we are given a really small maintenance_work_mem. In that
>       * case, just forget the last few tuples (we'll get 'em next time).
>       */
> -    if (vacrelstats->num_dead_tuples < vacrelstats->max_dead_tuples)
> +    if (vacrelstats->dead_tuples.num_entries < vacrelstats->dead_tuples.max_entries)
>      {
> -        vacrelstats->dead_tuples[vacrelstats->num_dead_tuples] = *itemptr;
> -        vacrelstats->num_dead_tuples++;
> +        DeadTuplesSegment *seg = DeadTuplesCurrentSegment(vacrelstats);
> +
> +        if (seg->num_dead_tuples >= seg->max_dead_tuples)
> +        {
> +            /*
> +             * The segment is overflowing, so we must allocate a new segment.
> +             * We could have a preallocated segment descriptor already, in
> +             * which case we just reinitialize it, or we may need to repalloc
> +             * the vacrelstats->dead_tuples array. In that case, seg will no
> +             * longer be valid, so we must be careful about that. In any case,
> +             * we must update the last_dead_tuple copy in the overflowing
> +             * segment descriptor.
> +             */
> +            Assert(seg->num_dead_tuples == seg->max_dead_tuples);
> +            seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1];
> +            if (vacrelstats->dead_tuples.last_seg + 1 >= vacrelstats->dead_tuples.num_segs)
> +            {
> +                int            new_num_segs = vacrelstats->dead_tuples.num_segs * 2;
> +
> +                vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *) repalloc(
> +                               (void *) vacrelstats->dead_tuples.dt_segments,
> +                                   new_num_segs * sizeof(DeadTuplesSegment));

Might be worth breaking this into some sub-statements, it's quite hard
to read.


> +                while (vacrelstats->dead_tuples.num_segs < new_num_segs)
> +                {
> +                    /* Initialize as "unallocated" */
> +                    DeadTuplesSegment *nseg = &(vacrelstats->dead_tuples.dt_segments[
> +                                         vacrelstats->dead_tuples.num_segs]);

dito.


> +/*
>   *    lazy_tid_reaped() -- is a particular tid deletable?
>   *
>   *        This has the right signature to be an IndexBulkDeleteCallback.
>   *
> - *        Assumes dead_tuples array is in sorted order.
> + *        Assumes the dead_tuples multiarray is in sorted order, both
> + *        the segment list and each segment itself, and that all segments'
> + *        last_dead_tuple fields up to date
>   */
>  static bool
>  lazy_tid_reaped(ItemPointer itemptr, void *state)

Have you done performance evaluation about potential performance
regressions in big indexes here?  IIRC this can be quite frequently
called?


I think this is reasonably close to commit, but unfortunately not quite
there yet. I.e. I personally won't polish this up & commit in the next
couple hours, but if somebody else wants to take that on...

Greetings,

Andres Freund



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> I've *not* read the history of this thread.  So I really might be
> missing some context.
>
>
>> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
>> From: Claudio Freire <klaussfreire@gmail.com>
>> Date: Mon, 12 Sep 2016 23:36:42 -0300
>> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
>>
>> Turn the dead_tuples array into a structure composed of several
>> exponentially bigger arrays, to enable usage of more than 1GB
>> of work mem during vacuum and thus reduce the number of full
>> index scans necessary to remove all dead tids when the memory is
>> available.
>
>>   * We are willing to use at most maintenance_work_mem (or perhaps
>>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>> - * initially allocate an array of TIDs of that size, with an upper limit that
>> + * initially allocate an array of TIDs of 128MB, or an upper limit that
>>   * depends on table size (this limit ensures we don't allocate a huge area
>> - * uselessly for vacuuming small tables).  If the array threatens to overflow,
>> - * we suspend the heap scan phase and perform a pass of index cleanup and page
>> - * compaction, then resume the heap scan with an empty TID array.
>> + * uselessly for vacuuming small tables). Additional arrays of increasingly
>> + * large sizes are allocated as they become necessary.
>> + *
>> + * The TID array is thus represented as a list of multiple segments of
>> + * varying size, beginning with the initial size of up to 128MB, and growing
>> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
>> + * is used up.
>
> When the chunk size is 128MB, I'm a bit unconvinced that using
> exponential growth is worth it. The allocator overhead can't be
> meaningful in comparison to collecting 128MB dead tuples, the potential
> waste is pretty big, and it increases memory fragmentation.

The exponential strategy is mainly to improve lookup time (ie: to
avoid large segment lists).

>> + * Lookup in that structure proceeds sequentially in the list of segments,
>> + * and with a binary search within each segment. Since segment's size grows
>> + * exponentially, this retains O(N log N) lookup complexity.
>
> N log N is a horrible lookup complexity.  That's the complexity of
> *sorting* an entire array.  I think you might be trying to argue that
> it's log(N) * log(N)? Once log(n) for the exponentially growing size of
> segments, one for the binary search?
>
> Afaics you could quite easily make it O(2 log(N)) by simply also doing
> binary search over the segments.  Might not be worth it due to the small
> constant involved normally.

It's a typo, yes, I meant O(log N) (which is equivalent to O(2 log N))

>> + * If the array threatens to overflow, we suspend the heap scan phase and
>> + * perform a pass of index cleanup and page compaction, then resume the heap
>> + * scan with an array of logically empty but already preallocated TID segments
>> + * to be refilled with more dead tuple TIDs.
>
> Hm, it's not really the array that overflows, it's m_w_m that'd be
> exceeded, right?

Yes, will rephrase. Although that's how the original comment expressed
the same concept.

>>  /*
>> + * Minimum (starting) size of the dead_tuples array segments. Will allocate
>> + * space for 128MB worth of tid pointers in the first segment, further segments
>> + * will grow in size exponentially. Don't make it too small or the segment list
>> + * will grow bigger than the sweetspot for search efficiency on big vacuums.
>> + */
>> +#define LAZY_MIN_TUPLES              Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
>
> That's not really the minimum, no? s/MIN/INIT/?

Ok

>> +typedef struct DeadTuplesSegment
>> +{
>> +     int                     num_dead_tuples;        /* # of entries in the segment */
>> +     int                     max_dead_tuples;        /* # of entries allocated in the segment */
>> +     ItemPointerData last_dead_tuple;        /* Copy of the last dead tuple (unset
>> +                                                                              * until the segment is fully
>> +                                                                              * populated) */
>> +     unsigned short padding;
>> +     ItemPointer dt_tids;    /* Array of dead tuples */
>> +}    DeadTuplesSegment;
>
> Whenever padding is needed, it should have an explanatory comment.  It's
> certainly not obvious to me wh it's neede here.

Ok

>> @@ -1598,6 +1657,11 @@ lazy_vacuum_index(Relation indrel,
>>       ivinfo.num_heap_tuples = vacrelstats->old_rel_tuples;
>>       ivinfo.strategy = vac_strategy;
>>
>> +     /* Finalize the current segment by setting its upper bound dead tuple */
>> +     seg = DeadTuplesCurrentSegment(vacrelstats);
>> +     if (seg->num_dead_tuples > 0)
>> +             seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1];
>
> Why don't we just maintain this here, for all of the segments?  Seems a
> bit easier.

Originally, I just wanted to maintain the validity of last_dead_tuple
as an invariant at all times. But it may be like you say, that it's
simpler to just maintain the invariant of all segments at finalization
time. I'll explore that possibility.

>> @@ -1973,7 +2037,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
>>  static void
>>  lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
>>  {
>> -     long            maxtuples;
>> +     long            maxtuples,
>> +                             mintuples;
>>       int                     vac_work_mem = IsAutoVacuumWorkerProcess() &&
>>       autovacuum_work_mem != -1 ?
>>       autovacuum_work_mem : maintenance_work_mem;
>> @@ -1982,7 +2047,6 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
>>       {
>>               maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
>>               maxtuples = Min(maxtuples, INT_MAX);
>> -             maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
>>
>>               /* curious coding here to ensure the multiplication can't overflow */
>>               if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
>> @@ -1996,10 +2060,18 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
>>               maxtuples = MaxHeapTuplesPerPage;
>>       }
>>
>> -     vacrelstats->num_dead_tuples = 0;
>> -     vacrelstats->max_dead_tuples = (int) maxtuples;
>> -     vacrelstats->dead_tuples = (ItemPointer)
>> -             palloc(maxtuples * sizeof(ItemPointerData));
>> +     mintuples = Min(LAZY_MIN_TUPLES, maxtuples);
>> +
>> +     vacrelstats->dead_tuples.num_entries = 0;
>> +     vacrelstats->dead_tuples.max_entries = (int) maxtuples;
>> +     vacrelstats->dead_tuples.num_segs = 1;
>> +     vacrelstats->dead_tuples.last_seg = 0;
>> +     vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *)
>> +             palloc(sizeof(DeadTuplesSegment));
>> +     vacrelstats->dead_tuples.dt_segments[0].dt_tids = (ItemPointer)
>> +             palloc(mintuples * sizeof(ItemPointerData));
>> +     vacrelstats->dead_tuples.dt_segments[0].max_dead_tuples = mintuples;
>> +     vacrelstats->dead_tuples.dt_segments[0].num_dead_tuples = 0;
>>  }
>
> Hm. Why don't we delay allocating dt_segments[0] till we actually need
> it?  It's not uncommon for vacuums not to be able to find any dead
> tuples, and it'd not change code in lazy_record_dead_tuple() much.

I avoided that because that would make dt_segments[last_seg] invalid
for the case of a just-initialized multiarray.

Some places in the code use a macro that references
dt_segments[last_seg] (mostly for indexless tables), and having to
check num_segs and do lazy initialization would have complicated the
code considerably.

Nonetheless, I'll re-check how viable doing that would be.

>> @@ -2014,31 +2086,147 @@ lazy_record_dead_tuple(LVRelStats *vacrelstats,
>>        * could if we are given a really small maintenance_work_mem. In that
>>        * case, just forget the last few tuples (we'll get 'em next time).
>>        */
>> -     if (vacrelstats->num_dead_tuples < vacrelstats->max_dead_tuples)
>> +     if (vacrelstats->dead_tuples.num_entries < vacrelstats->dead_tuples.max_entries)
>>       {
>> -             vacrelstats->dead_tuples[vacrelstats->num_dead_tuples] = *itemptr;
>> -             vacrelstats->num_dead_tuples++;
>> +             DeadTuplesSegment *seg = DeadTuplesCurrentSegment(vacrelstats);
>> +
>> +             if (seg->num_dead_tuples >= seg->max_dead_tuples)
>> +             {
>> +                     /*
>> +                      * The segment is overflowing, so we must allocate a new segment.
>> +                      * We could have a preallocated segment descriptor already, in
>> +                      * which case we just reinitialize it, or we may need to repalloc
>> +                      * the vacrelstats->dead_tuples array. In that case, seg will no
>> +                      * longer be valid, so we must be careful about that. In any case,
>> +                      * we must update the last_dead_tuple copy in the overflowing
>> +                      * segment descriptor.
>> +                      */
>> +                     Assert(seg->num_dead_tuples == seg->max_dead_tuples);
>> +                     seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1];
>> +                     if (vacrelstats->dead_tuples.last_seg + 1 >= vacrelstats->dead_tuples.num_segs)
>> +                     {
>> +                             int                     new_num_segs = vacrelstats->dead_tuples.num_segs * 2;
>> +
>> +                             vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *) repalloc(
>> +                                                        (void *) vacrelstats->dead_tuples.dt_segments,
>> +                                                                new_num_segs * sizeof(DeadTuplesSegment));
>
> Might be worth breaking this into some sub-statements, it's quite hard
> to read.

Breaking what precisely? The comment?

>> +                             while (vacrelstats->dead_tuples.num_segs < new_num_segs)
>> +                             {
>> +                                     /* Initialize as "unallocated" */
>> +                                     DeadTuplesSegment *nseg = &(vacrelstats->dead_tuples.dt_segments[
>> +                                                                              vacrelstats->dead_tuples.num_segs]);
>
> dito.

I don't really get what you're asking here.

>> +/*
>>   *   lazy_tid_reaped() -- is a particular tid deletable?
>>   *
>>   *           This has the right signature to be an IndexBulkDeleteCallback.
>>   *
>> - *           Assumes dead_tuples array is in sorted order.
>> + *           Assumes the dead_tuples multiarray is in sorted order, both
>> + *           the segment list and each segment itself, and that all segments'
>> + *           last_dead_tuple fields up to date
>>   */
>>  static bool
>>  lazy_tid_reaped(ItemPointer itemptr, void *state)
>
> Have you done performance evaluation about potential performance
> regressions in big indexes here?  IIRC this can be quite frequently
> called?

Yes, the benchmarks are upthread. The earlier runs were run on my
laptop and made little sense, so I'd ignore them as inaccurate. The
latest run[1] with a pgbench scale of 4000 gave an improvement in CPU
time (ie: faster) of about 20%. Anastasia did another one[2] and saw
improvements as well, roughly 30%, though it's not measuring CPU time
but rather elapsed time.

Even small scales (100) saw an improvement as well, although possibly
below the noise floor. Tests are very slow so I haven't run enough to
measure variance and statistical significance.

I blame the improvement not only on better cache locality (the initial
search on the segment list usually fits on L1) but also on less
overall work due to needing less index scans, and the fact that
overall lookup complexity remains O(log N) due to the exponential
segment growth strategy.

[1] https://www.postgresql.org/message-id/CAGTBQpa6NFGO_6g_y_7zQx8L9GcHDSQKYdo1tGuh791z6PYgEg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/13bee467-bdcf-d3b9-c0ee-e2792fd46839%40postgrespro.ru

>
>
> I think this is reasonably close to commit, but unfortunately not quite
> there yet. I.e. I personally won't polish this up & commit in the next
> couple hours, but if somebody else wants to take that on...
>
> Greetings,
>
> Andres Freund

I'll post an updated patch with the requested changes shortly.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Fri, Apr 7, 2017 at 7:43 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> + * Lookup in that structure proceeds sequentially in the list of segments,
>>> + * and with a binary search within each segment. Since segment's size grows
>>> + * exponentially, this retains O(N log N) lookup complexity.
>>
>> N log N is a horrible lookup complexity.  That's the complexity of
>> *sorting* an entire array.  I think you might be trying to argue that
>> it's log(N) * log(N)? Once log(n) for the exponentially growing size of
>> segments, one for the binary search?
>>
>> Afaics you could quite easily make it O(2 log(N)) by simply also doing
>> binary search over the segments.  Might not be worth it due to the small
>> constant involved normally.
>
> It's a typo, yes, I meant O(log N) (which is equivalent to O(2 log N))


To clarify, lookup over the segments is linear, so it's O(M) with M
the number of segments, then the binary search is O(log N) with N the
number of dead tuples.

So lookup is O(M + log N), but M < log N because of the segment's
exponential growth, therefore the lookup is O(2 log N)



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Andres Freund
Date:
Hi,


On 2017-04-07 19:43:39 -0300, Claudio Freire wrote:
> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <andres@anarazel.de> wrote:
> > Hi,
> >
> > I've *not* read the history of this thread.  So I really might be
> > missing some context.
> >
> >
> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
> >> From: Claudio Freire <klaussfreire@gmail.com>
> >> Date: Mon, 12 Sep 2016 23:36:42 -0300
> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
> >>
> >> Turn the dead_tuples array into a structure composed of several
> >> exponentially bigger arrays, to enable usage of more than 1GB
> >> of work mem during vacuum and thus reduce the number of full
> >> index scans necessary to remove all dead tids when the memory is
> >> available.
> >
> >>   * We are willing to use at most maintenance_work_mem (or perhaps
> >>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
> >> - * initially allocate an array of TIDs of that size, with an upper limit that
> >> + * initially allocate an array of TIDs of 128MB, or an upper limit that
> >>   * depends on table size (this limit ensures we don't allocate a huge area
> >> - * uselessly for vacuuming small tables).  If the array threatens to overflow,
> >> - * we suspend the heap scan phase and perform a pass of index cleanup and page
> >> - * compaction, then resume the heap scan with an empty TID array.
> >> + * uselessly for vacuuming small tables). Additional arrays of increasingly
> >> + * large sizes are allocated as they become necessary.
> >> + *
> >> + * The TID array is thus represented as a list of multiple segments of
> >> + * varying size, beginning with the initial size of up to 128MB, and growing
> >> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
> >> + * is used up.
> >
> > When the chunk size is 128MB, I'm a bit unconvinced that using
> > exponential growth is worth it. The allocator overhead can't be
> > meaningful in comparison to collecting 128MB dead tuples, the potential
> > waste is pretty big, and it increases memory fragmentation.
> 
> The exponential strategy is mainly to improve lookup time (ie: to
> avoid large segment lists).

Well, if we were to do binary search on the segment list, that'd not be
necessary.

> >> +             if (seg->num_dead_tuples >= seg->max_dead_tuples)
> >> +             {
> >> +                     /*
> >> +                      * The segment is overflowing, so we must allocate a new segment.
> >> +                      * We could have a preallocated segment descriptor already, in
> >> +                      * which case we just reinitialize it, or we may need to repalloc
> >> +                      * the vacrelstats->dead_tuples array. In that case, seg will no
> >> +                      * longer be valid, so we must be careful about that. In any case,
> >> +                      * we must update the last_dead_tuple copy in the overflowing
> >> +                      * segment descriptor.
> >> +                      */
> >> +                     Assert(seg->num_dead_tuples == seg->max_dead_tuples);
> >> +                     seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1];
> >> +                     if (vacrelstats->dead_tuples.last_seg + 1 >= vacrelstats->dead_tuples.num_segs)
> >> +                     {
> >> +                             int                     new_num_segs = vacrelstats->dead_tuples.num_segs * 2;
> >> +
> >> +                             vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *) repalloc(
> >> +                                                        (void *) vacrelstats->dead_tuples.dt_segments,
> >> +                                                                new_num_segs * sizeof(DeadTuplesSegment));
> >
> > Might be worth breaking this into some sub-statements, it's quite hard
> > to read.
> 
> Breaking what precisely? The comment?

No, the three-line statement computing the new value of
dead_tuples.dt_segments.  I'd at least assign dead_tuples to a local
variable, to cut the length of the statement down.


> >> +                             while (vacrelstats->dead_tuples.num_segs < new_num_segs)
> >> +                             {
> >> +                                     /* Initialize as "unallocated" */
> >> +                                     DeadTuplesSegment *nseg = &(vacrelstats->dead_tuples.dt_segments[
> >> +
vacrelstats->dead_tuples.num_segs]);
> >
> > dito.
> 
> I don't really get what you're asking here.

Trying to simplify/shorten the statement.


> >> +/*
> >>   *   lazy_tid_reaped() -- is a particular tid deletable?
> >>   *
> >>   *           This has the right signature to be an IndexBulkDeleteCallback.
> >>   *
> >> - *           Assumes dead_tuples array is in sorted order.
> >> + *           Assumes the dead_tuples multiarray is in sorted order, both
> >> + *           the segment list and each segment itself, and that all segments'
> >> + *           last_dead_tuple fields up to date
> >>   */
> >>  static bool
> >>  lazy_tid_reaped(ItemPointer itemptr, void *state)
> >
> > Have you done performance evaluation about potential performance
> > regressions in big indexes here?  IIRC this can be quite frequently
> > called?
> 
> Yes, the benchmarks are upthread. The earlier runs were run on my
> laptop and made little sense, so I'd ignore them as inaccurate. The
> latest run[1] with a pgbench scale of 4000 gave an improvement in CPU
> time (ie: faster) of about 20%. Anastasia did another one[2] and saw
> improvements as well, roughly 30%, though it's not measuring CPU time
> but rather elapsed time.

I'd be more concerned about cases that'd already fit into memory, not ones
where we avoid doing another scan - and I think you mostly measured that?

- Andres



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Fri, Apr 7, 2017 at 9:56 PM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
>
> On 2017-04-07 19:43:39 -0300, Claudio Freire wrote:
>> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Hi,
>> >
>> > I've *not* read the history of this thread.  So I really might be
>> > missing some context.
>> >
>> >
>> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
>> >> From: Claudio Freire <klaussfreire@gmail.com>
>> >> Date: Mon, 12 Sep 2016 23:36:42 -0300
>> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
>> >>
>> >> Turn the dead_tuples array into a structure composed of several
>> >> exponentially bigger arrays, to enable usage of more than 1GB
>> >> of work mem during vacuum and thus reduce the number of full
>> >> index scans necessary to remove all dead tids when the memory is
>> >> available.
>> >
>> >>   * We are willing to use at most maintenance_work_mem (or perhaps
>> >>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>> >> - * initially allocate an array of TIDs of that size, with an upper limit that
>> >> + * initially allocate an array of TIDs of 128MB, or an upper limit that
>> >>   * depends on table size (this limit ensures we don't allocate a huge area
>> >> - * uselessly for vacuuming small tables).  If the array threatens to overflow,
>> >> - * we suspend the heap scan phase and perform a pass of index cleanup and page
>> >> - * compaction, then resume the heap scan with an empty TID array.
>> >> + * uselessly for vacuuming small tables). Additional arrays of increasingly
>> >> + * large sizes are allocated as they become necessary.
>> >> + *
>> >> + * The TID array is thus represented as a list of multiple segments of
>> >> + * varying size, beginning with the initial size of up to 128MB, and growing
>> >> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
>> >> + * is used up.
>> >
>> > When the chunk size is 128MB, I'm a bit unconvinced that using
>> > exponential growth is worth it. The allocator overhead can't be
>> > meaningful in comparison to collecting 128MB dead tuples, the potential
>> > waste is pretty big, and it increases memory fragmentation.
>>
>> The exponential strategy is mainly to improve lookup time (ie: to
>> avoid large segment lists).
>
> Well, if we were to do binary search on the segment list, that'd not be
> necessary.

True, but the initial lookup might be slower in the end, since the
array would be bigger and cache locality worse.

Why do you say exponential growth fragments memory? AFAIK, all those
allocations are well beyond the point where malloc starts mmaping
memory, so each of those segments should be a mmap segment,
independently freeable.

>> >> +             if (seg->num_dead_tuples >= seg->max_dead_tuples)
>> >> +             {
>> >> +                     /*
>> >> +                      * The segment is overflowing, so we must allocate a new segment.
>> >> +                      * We could have a preallocated segment descriptor already, in
>> >> +                      * which case we just reinitialize it, or we may need to repalloc
>> >> +                      * the vacrelstats->dead_tuples array. In that case, seg will no
>> >> +                      * longer be valid, so we must be careful about that. In any case,
>> >> +                      * we must update the last_dead_tuple copy in the overflowing
>> >> +                      * segment descriptor.
>> >> +                      */
>> >> +                     Assert(seg->num_dead_tuples == seg->max_dead_tuples);
>> >> +                     seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1];
>> >> +                     if (vacrelstats->dead_tuples.last_seg + 1 >= vacrelstats->dead_tuples.num_segs)
>> >> +                     {
>> >> +                             int                     new_num_segs = vacrelstats->dead_tuples.num_segs * 2;
>> >> +
>> >> +                             vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *) repalloc(
>> >> +                                                        (void *) vacrelstats->dead_tuples.dt_segments,
>> >> +                                                                new_num_segs * sizeof(DeadTuplesSegment));
>> >
>> > Might be worth breaking this into some sub-statements, it's quite hard
>> > to read.
>>
>> Breaking what precisely? The comment?
>
> No, the three-line statement computing the new value of
> dead_tuples.dt_segments.  I'd at least assign dead_tuples to a local
> variable, to cut the length of the statement down.

Ah, alright. Will try to do that.

>> >> +/*
>> >>   *   lazy_tid_reaped() -- is a particular tid deletable?
>> >>   *
>> >>   *           This has the right signature to be an IndexBulkDeleteCallback.
>> >>   *
>> >> - *           Assumes dead_tuples array is in sorted order.
>> >> + *           Assumes the dead_tuples multiarray is in sorted order, both
>> >> + *           the segment list and each segment itself, and that all segments'
>> >> + *           last_dead_tuple fields up to date
>> >>   */
>> >>  static bool
>> >>  lazy_tid_reaped(ItemPointer itemptr, void *state)
>> >
>> > Have you done performance evaluation about potential performance
>> > regressions in big indexes here?  IIRC this can be quite frequently
>> > called?
>>
>> Yes, the benchmarks are upthread. The earlier runs were run on my
>> laptop and made little sense, so I'd ignore them as inaccurate. The
>> latest run[1] with a pgbench scale of 4000 gave an improvement in CPU
>> time (ie: faster) of about 20%. Anastasia did another one[2] and saw
>> improvements as well, roughly 30%, though it's not measuring CPU time
>> but rather elapsed time.
>
> I'd be more concerned about cases that'd already fit into memory, not ones
> where we avoid doing another scan - and I think you mostly measured that?
>
> - Andres

Well, scale 400 is pretty much as big as you can get with the old 1GB
limit, and also suffered no significant regression. Although, true, id
didn't significantly improve either.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Andres Freund
Date:
On 2017-04-07 22:06:13 -0300, Claudio Freire wrote:
> On Fri, Apr 7, 2017 at 9:56 PM, Andres Freund <andres@anarazel.de> wrote:
> > Hi,
> >
> >
> > On 2017-04-07 19:43:39 -0300, Claudio Freire wrote:
> >> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <andres@anarazel.de> wrote:
> >> > Hi,
> >> >
> >> > I've *not* read the history of this thread.  So I really might be
> >> > missing some context.
> >> >
> >> >
> >> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
> >> >> From: Claudio Freire <klaussfreire@gmail.com>
> >> >> Date: Mon, 12 Sep 2016 23:36:42 -0300
> >> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
> >> >>
> >> >> Turn the dead_tuples array into a structure composed of several
> >> >> exponentially bigger arrays, to enable usage of more than 1GB
> >> >> of work mem during vacuum and thus reduce the number of full
> >> >> index scans necessary to remove all dead tids when the memory is
> >> >> available.
> >> >
> >> >>   * We are willing to use at most maintenance_work_mem (or perhaps
> >> >>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
> >> >> - * initially allocate an array of TIDs of that size, with an upper limit that
> >> >> + * initially allocate an array of TIDs of 128MB, or an upper limit that
> >> >>   * depends on table size (this limit ensures we don't allocate a huge area
> >> >> - * uselessly for vacuuming small tables).  If the array threatens to overflow,
> >> >> - * we suspend the heap scan phase and perform a pass of index cleanup and page
> >> >> - * compaction, then resume the heap scan with an empty TID array.
> >> >> + * uselessly for vacuuming small tables). Additional arrays of increasingly
> >> >> + * large sizes are allocated as they become necessary.
> >> >> + *
> >> >> + * The TID array is thus represented as a list of multiple segments of
> >> >> + * varying size, beginning with the initial size of up to 128MB, and growing
> >> >> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
> >> >> + * is used up.
> >> >
> >> > When the chunk size is 128MB, I'm a bit unconvinced that using
> >> > exponential growth is worth it. The allocator overhead can't be
> >> > meaningful in comparison to collecting 128MB dead tuples, the potential
> >> > waste is pretty big, and it increases memory fragmentation.
> >>
> >> The exponential strategy is mainly to improve lookup time (ie: to
> >> avoid large segment lists).
> >
> > Well, if we were to do binary search on the segment list, that'd not be
> > necessary.
> 
> True, but the initial lookup might be slower in the end, since the
> array would be bigger and cache locality worse.
> 
> Why do you say exponential growth fragments memory? AFAIK, all those
> allocations are well beyond the point where malloc starts mmaping
> memory, so each of those segments should be a mmap segment,
> independently freeable.

Not all platforms have that, and even on platforms with it, frequent,
unevenly sized, very large allocations can lead to enough fragmentation
that further allocations are harder and fragment / enlarge the
pagetable.


> >> Yes, the benchmarks are upthread. The earlier runs were run on my
> >> laptop and made little sense, so I'd ignore them as inaccurate. The
> >> latest run[1] with a pgbench scale of 4000 gave an improvement in CPU
> >> time (ie: faster) of about 20%. Anastasia did another one[2] and saw
> >> improvements as well, roughly 30%, though it's not measuring CPU time
> >> but rather elapsed time.
> >
> > I'd be more concerned about cases that'd already fit into memory, not ones
> > where we avoid doing another scan - and I think you mostly measured that?
> >
> > - Andres
> 
> Well, scale 400 is pretty much as big as you can get with the old 1GB
> limit, and also suffered no significant regression. Although, true, id
> didn't significantly improve either.

Aren't more interesting cases those where not that many dead tuples are
found, but the indexes are pretty large?  IIRC the index vacuum scans
still visit every leaf index tuple, no?

Greetings,

Andres Freund



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Fri, Apr 7, 2017 at 10:12 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-04-07 22:06:13 -0300, Claudio Freire wrote:
>> On Fri, Apr 7, 2017 at 9:56 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Hi,
>> >
>> >
>> > On 2017-04-07 19:43:39 -0300, Claudio Freire wrote:
>> >> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <andres@anarazel.de> wrote:
>> >> > Hi,
>> >> >
>> >> > I've *not* read the history of this thread.  So I really might be
>> >> > missing some context.
>> >> >
>> >> >
>> >> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
>> >> >> From: Claudio Freire <klaussfreire@gmail.com>
>> >> >> Date: Mon, 12 Sep 2016 23:36:42 -0300
>> >> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
>> >> >>
>> >> >> Turn the dead_tuples array into a structure composed of several
>> >> >> exponentially bigger arrays, to enable usage of more than 1GB
>> >> >> of work mem during vacuum and thus reduce the number of full
>> >> >> index scans necessary to remove all dead tids when the memory is
>> >> >> available.
>> >> >
>> >> >>   * We are willing to use at most maintenance_work_mem (or perhaps
>> >> >>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>> >> >> - * initially allocate an array of TIDs of that size, with an upper limit that
>> >> >> + * initially allocate an array of TIDs of 128MB, or an upper limit that
>> >> >>   * depends on table size (this limit ensures we don't allocate a huge area
>> >> >> - * uselessly for vacuuming small tables).  If the array threatens to overflow,
>> >> >> - * we suspend the heap scan phase and perform a pass of index cleanup and page
>> >> >> - * compaction, then resume the heap scan with an empty TID array.
>> >> >> + * uselessly for vacuuming small tables). Additional arrays of increasingly
>> >> >> + * large sizes are allocated as they become necessary.
>> >> >> + *
>> >> >> + * The TID array is thus represented as a list of multiple segments of
>> >> >> + * varying size, beginning with the initial size of up to 128MB, and growing
>> >> >> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
>> >> >> + * is used up.
>> >> >
>> >> > When the chunk size is 128MB, I'm a bit unconvinced that using
>> >> > exponential growth is worth it. The allocator overhead can't be
>> >> > meaningful in comparison to collecting 128MB dead tuples, the potential
>> >> > waste is pretty big, and it increases memory fragmentation.
>> >>
>> >> The exponential strategy is mainly to improve lookup time (ie: to
>> >> avoid large segment lists).
>> >
>> > Well, if we were to do binary search on the segment list, that'd not be
>> > necessary.
>>
>> True, but the initial lookup might be slower in the end, since the
>> array would be bigger and cache locality worse.
>>
>> Why do you say exponential growth fragments memory? AFAIK, all those
>> allocations are well beyond the point where malloc starts mmaping
>> memory, so each of those segments should be a mmap segment,
>> independently freeable.
>
> Not all platforms have that, and even on platforms with it, frequent,
> unevenly sized, very large allocations can lead to enough fragmentation
> that further allocations are harder and fragment / enlarge the
> pagetable.

I wouldn't call this frequent. You can get at most slightly more than
a dozen such allocations given the current limits.
And allocation sizes are quite regular - you get 128M or multiples of
128M, so each free block can be reused for N smaller allocations if
needed. I don't think it has much potential to fragment memory.

This isn't significantly different from tuplesort or any other code
that can do big allocations, and the differences favor less
fragmentation than those, so I don't see why this would need special
treatment.

My point being that it's not been simple getting to a point where this
beats even in CPU time the original single binary search.
If we're to scrap this implementation and go for a double binary
search, I'd like to have a clear measurable benefit to chase from
doing so. Fragmentation is hard to measure, and I cannot get CPU-bound
vacuums on the test hardware I have to test lookup performance at big
scales.

>> >> Yes, the benchmarks are upthread. The earlier runs were run on my
>> >> laptop and made little sense, so I'd ignore them as inaccurate. The
>> >> latest run[1] with a pgbench scale of 4000 gave an improvement in CPU
>> >> time (ie: faster) of about 20%. Anastasia did another one[2] and saw
>> >> improvements as well, roughly 30%, though it's not measuring CPU time
>> >> but rather elapsed time.
>> >
>> > I'd be more concerned about cases that'd already fit into memory, not ones
>> > where we avoid doing another scan - and I think you mostly measured that?
>> >
>> > - Andres
>>
>> Well, scale 400 is pretty much as big as you can get with the old 1GB
>> limit, and also suffered no significant regression. Although, true, id
>> didn't significantly improve either.
>
> Aren't more interesting cases those where not that many dead tuples are
> found, but the indexes are pretty large?  IIRC the index vacuum scans
> still visit every leaf index tuple, no?

Indeed they do, and that's what motivated this patch. But I'd need
TB-sized tables to set up something like that. I don't have the
hardware or time available to do that (vacuum on bloated TB-sized
tables can take days in my experience). Scale 4000 is as big as I can
get without running out of space for the tests in my test hardware.

If anybody else has the ability, I'd be thankful if they did test it
under those conditions, but I cannot. I think Anastasia's test is
closer to such a test, that's probably why it shows a bigger
improvement in total elapsed time.

Our production database could possibly be used, but it can take about
a week to clone it, upgrade it (it's 9.5 currently), and run the
relevant vacuum.

I did perform tests against the same pgbench databases referenced in
the post I linked earlier, but deleting only a fraction of the rows,
or on uncorrelated indexes. The benchmarks weren't very interesting,
and results were consistent with the linked benchmark (slight CPU time
improvement, just less impactful), so I didn't post them.

I think all those tests show that, if there's a workload that
regresses, it's a rare one, running on very powerful I/O hardware (to
make vacuum CPU-bound). And even if that were to happen, considering a
single (or fewer) index scan, even if slower, will cause less WAL
traffic that has to be archived/streamed, it would still most likely
be a win overall.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Fri, Apr 7, 2017 at 10:06 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> >> +             if (seg->num_dead_tuples >= seg->max_dead_tuples)
>>> >> +             {
>>> >> +                     /*
>>> >> +                      * The segment is overflowing, so we must allocate a new segment.
>>> >> +                      * We could have a preallocated segment descriptor already, in
>>> >> +                      * which case we just reinitialize it, or we may need to repalloc
>>> >> +                      * the vacrelstats->dead_tuples array. In that case, seg will no
>>> >> +                      * longer be valid, so we must be careful about that. In any case,
>>> >> +                      * we must update the last_dead_tuple copy in the overflowing
>>> >> +                      * segment descriptor.
>>> >> +                      */
>>> >> +                     Assert(seg->num_dead_tuples == seg->max_dead_tuples);
>>> >> +                     seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1];
>>> >> +                     if (vacrelstats->dead_tuples.last_seg + 1 >= vacrelstats->dead_tuples.num_segs)
>>> >> +                     {
>>> >> +                             int                     new_num_segs = vacrelstats->dead_tuples.num_segs * 2;
>>> >> +
>>> >> +                             vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *) repalloc(
>>> >> +                                                        (void *) vacrelstats->dead_tuples.dt_segments,
>>> >> +                                                                new_num_segs * sizeof(DeadTuplesSegment));
>>> >
>>> > Might be worth breaking this into some sub-statements, it's quite hard
>>> > to read.
>>>
>>> Breaking what precisely? The comment?
>>
>> No, the three-line statement computing the new value of
>> dead_tuples.dt_segments.  I'd at least assign dead_tuples to a local
>> variable, to cut the length of the statement down.
>
> Ah, alright. Will try to do that.

Attached is an updated patch set with the requested changes.

Segment allocation still follows the exponential strategy, and segment
lookup is still linear.

I rebased the early free patch (patch 3) to apply on top of the v9
patch 2 (it needed some changes). I recognize the early free patch
didn't get nearly as much scrutiny, so I'm fine with commiting only 2
if that one's ready to go but 3 isn't.

If it's decided to go for fixed 128M segments and a binary search of
segments, I don't think I can get that ready and tested before the
commitfest ends.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
David Steele
Date:
On 4/7/17 10:19 PM, Claudio Freire wrote:
> 
> I rebased the early free patch (patch 3) to apply on top of the v9
> patch 2 (it needed some changes). I recognize the early free patch
> didn't get nearly as much scrutiny, so I'm fine with commiting only 2
> if that one's ready to go but 3 isn't.
> 
> If it's decided to go for fixed 128M segments and a binary search of
> segments, I don't think I can get that ready and tested before the
> commitfest ends.

This submission has been moved to CF 2017-07.

-- 
-David
david@pgmasters.net



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Fri, Apr 7, 2017 at 9:12 PM, Andres Freund <andres@anarazel.de> wrote:
>> Why do you say exponential growth fragments memory? AFAIK, all those
>> allocations are well beyond the point where malloc starts mmaping
>> memory, so each of those segments should be a mmap segment,
>> independently freeable.
>
> Not all platforms have that, and even on platforms with it, frequent,
> unevenly sized, very large allocations can lead to enough fragmentation
> that further allocations are harder and fragment / enlarge the
> pagetable.

Such a thing is completely outside my personal experience.  I've never
heard of a case where a 64-bit platform fails to allocate memory
because something (what?) is fragmented.  Page table memory usage is a
concern at some level, but probably less so for autovacuum workers
than for most backends, because autovacuum workers (where most
vacuuming is done) exit after one pass through pg_class.  Although I
think our memory footprint is a topic that could use more energy, I
don't really see any reason to think that pagetable bloat caused my
unevenly sized allocations in short-lived processes is the place to
start worrying.

That having been said, IIRC, I did propose quite a ways upthread that
we use a fixed chunk size, just because it would use less actual
memory, never mind the size of the page table.  I mean, if you
allocate in chunks of 64MB, which I think is what I proposed, you'll
never waste more than 64MB.  If you allocate in
exponentially-increasing chunk sizes starting at 128MB, you could
easily waste much more.  Let's imagine a 1TB table where 20% of the
tuples are dead due to some large bulk operation (a bulk load failed,
or a bulk delete succeeded, or a bulk update happened).  Back of the
envelope calculation:

1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of
maintenance_work_mem

So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite
enough so we'll allocate another 8GB, for a total of 16256MB, but more
than three-quarters of that last allocation ends up being wasted.
I've been told on this list before that doubling is the one true way
of increasing the size of an allocated chunk of memory, but I'm still
a bit unconvinced.

On the other hand, if we did allocate fixed chunks of, say, 64MB, we
could end up with an awful lot of them.  For example, in the example
above, 9216MB/64MB = 144 chunks.  Is that number of mappings going to
make the VM system unhappy on any of the platforms we care about?  Is
that a bigger or smaller problem than what you (Andres) are worrying
about?  I don't know.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of
> maintenance_work_mem
>
> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite
> enough so we'll allocate another 8GB, for a total of 16256MB, but more
> than three-quarters of that last allocation ends up being wasted.
> I've been told on this list before that doubling is the one true way
> of increasing the size of an allocated chunk of memory, but I'm still
> a bit unconvinced.

There you're wrong. The allocation is capped to 1GB, so wastage has an
upper bound of 1GB.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Tue, Apr 11, 2017 at 3:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of
>> maintenance_work_mem
>>
>> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite
>> enough so we'll allocate another 8GB, for a total of 16256MB, but more
>> than three-quarters of that last allocation ends up being wasted.
>> I've been told on this list before that doubling is the one true way
>> of increasing the size of an allocated chunk of memory, but I'm still
>> a bit unconvinced.
>
> There you're wrong. The allocation is capped to 1GB, so wastage has an
> upper bound of 1GB.

And total m_w_m for vacuum is still capped to 12GB (as big you can get
with 32-bit integer indices).

So you can get at most 15 segments (a binary search is thus not worth
it), and overallocate by at most 1GB (the maximum segment size).

At least that's my rationale.

Removing the 12GB limit requires a bit of care (there are some 32-bit
counters still around I believe).



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Tue, Apr 11, 2017 at 2:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of
>> maintenance_work_mem
>>
>> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite
>> enough so we'll allocate another 8GB, for a total of 16256MB, but more
>> than three-quarters of that last allocation ends up being wasted.
>> I've been told on this list before that doubling is the one true way
>> of increasing the size of an allocated chunk of memory, but I'm still
>> a bit unconvinced.
>
> There you're wrong. The allocation is capped to 1GB, so wastage has an
> upper bound of 1GB.

Ah, OK.  Sorry, didn't really look at the code.  I stand corrected,
but then it seems a bit strange to me that the largest and smallest
allocations are only 8x different.  I still don't really understand
what that buys us.  What would we lose if we just made 'em all 128MB?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Tue, Apr 11, 2017 at 4:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 11, 2017 at 2:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of
>>> maintenance_work_mem
>>>
>>> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite
>>> enough so we'll allocate another 8GB, for a total of 16256MB, but more
>>> than three-quarters of that last allocation ends up being wasted.
>>> I've been told on this list before that doubling is the one true way
>>> of increasing the size of an allocated chunk of memory, but I'm still
>>> a bit unconvinced.
>>
>> There you're wrong. The allocation is capped to 1GB, so wastage has an
>> upper bound of 1GB.
>
> Ah, OK.  Sorry, didn't really look at the code.  I stand corrected,
> but then it seems a bit strange to me that the largest and smallest
> allocations are only 8x different. I still don't really understand
> what that buys us.

Basically, attacking the problem (that, I think, you mentioned) of
very small systems in which overallocation for small vacuums was an
issue.

The "slow start" behavior of starting with smaller segments tries to
improve the situation for small vacuums, not big ones.

By starting at 128M and growing up to 1GB, overallocation is bound to
the range 128M-1GB and is proportional to the amount of dead tuples,
not table size, as it was before. Starting at 128M helps the initial
segment search, but I could readily go for starting at 64M, I don't
think it would make a huge difference. Removing exponential growth,
however, would.

As the patch stands, small systems (say 32-bit systems) without
overcommit and with slowly-changing data can now set high m_w_m
without running into overallocation issues with autovacuum reserving
too much virtual space, as it will reserve memory only proportional to
the amount of dead tuples. Previously, it would reserve all of m_w_m
regardless of whether it was needed or not, with the only exception
being really small tables, so m_w_m=1GB was unworkable in those cases.
Now it should be fine.

> What would we lose if we just made 'em all 128MB?

TBH, not that much. We'd need 8x compares to find the segment, that
forces a switch to binary search of the segments, which is less
cache-friendly. So it's more complex code, less cache locality. I'm
just not sure what's the benefit given current limits.

The only aim of this multiarray approach was making *virtual address
space reservations* proportional to the amount of actual memory
needed, as opposed to configured limits. It doesn't need to be a tight
fit, because calling palloc on its own doesn't actually use that
memory, at least on big allocations like these - the OS will not map
the memory pages until they're first touched. That's true in most
modern systems, and many ancient ones too.

In essence, the patch as it is proposed, doesn't *need* a binary
search, because the segment list can only grow up to 15 segments at
its biggest, and that's a size small enough that linear search will
outperform (or at least perform as well as) binary search. Reducing
the initial segment size wouldn't change that. If the 12GB limit is
lifted, or the maximum segment size reduced (from 1GB to 128MB for
example), however, that would change.

I'd be more in favor of lifting the 12GB limit than of reducing the
maximum segment size, for the reasons above. Raising the 12GB limit
has concrete and readily apparent benefits, whereas using bigger (or
smaller) segments is far more debatable. Yes, that will need a binary
search. But, I was hoping that could be a second (or third) patch, to
keep things simple, and benefits measurable.

Also, the plan as discussed in this very long thread, was to
eventually try to turn segments into bitmaps if dead tuple density was
big enough. That benefits considerably from big segments, since lookup
on a bitmap is O(1) - the bigger the segments, the faster the lookup,
as the search on the segment list would be dominant.

So... what shall we do?

At this point, I've given all my arguments for the current design. If
the more senior developers don't agree, I'll be happy to try your way.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Tue, Apr 11, 2017 at 4:38 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> In essence, the patch as it is proposed, doesn't *need* a binary
> search, because the segment list can only grow up to 15 segments at
> its biggest, and that's a size small enough that linear search will
> outperform (or at least perform as well as) binary search. Reducing
> the initial segment size wouldn't change that. If the 12GB limit is
> lifted, or the maximum segment size reduced (from 1GB to 128MB for
> example), however, that would change.
>
> I'd be more in favor of lifting the 12GB limit than of reducing the
> maximum segment size, for the reasons above. Raising the 12GB limit
> has concrete and readily apparent benefits, whereas using bigger (or
> smaller) segments is far more debatable. Yes, that will need a binary
> search. But, I was hoping that could be a second (or third) patch, to
> keep things simple, and benefits measurable.

To me, it seems a bit short-sighted to say, OK, let's use a linear
search because there's this 12GB limit so we can limit ourselves to 15
segments.  Because somebody will want to remove that 12GB limit, and
then we'll have to revisit the whole thing anyway.  I think, anyway.

What's not clear to me is how sensitive the performance of vacuum is
to the number of cycles used here.  For a large index, the number of
searches will presumably be quite large, so it does seem worth
worrying about performance.  But if we just always used a binary
search, would that lose enough performance with small numbers of
segments that anyone would care?  If so, maybe we need to use linear
search for small numbers of segments and switch to binary search with
larger numbers of segments.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Apr 12, 2017 at 4:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 11, 2017 at 4:38 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> In essence, the patch as it is proposed, doesn't *need* a binary
>> search, because the segment list can only grow up to 15 segments at
>> its biggest, and that's a size small enough that linear search will
>> outperform (or at least perform as well as) binary search. Reducing
>> the initial segment size wouldn't change that. If the 12GB limit is
>> lifted, or the maximum segment size reduced (from 1GB to 128MB for
>> example), however, that would change.
>>
>> I'd be more in favor of lifting the 12GB limit than of reducing the
>> maximum segment size, for the reasons above. Raising the 12GB limit
>> has concrete and readily apparent benefits, whereas using bigger (or
>> smaller) segments is far more debatable. Yes, that will need a binary
>> search. But, I was hoping that could be a second (or third) patch, to
>> keep things simple, and benefits measurable.
>
> To me, it seems a bit short-sighted to say, OK, let's use a linear
> search because there's this 12GB limit so we can limit ourselves to 15
> segments.  Because somebody will want to remove that 12GB limit, and
> then we'll have to revisit the whole thing anyway.  I think, anyway.

Ok, attached an updated patch that implements the binary search

> What's not clear to me is how sensitive the performance of vacuum is
> to the number of cycles used here.  For a large index, the number of
> searches will presumably be quite large, so it does seem worth
> worrying about performance.  But if we just always used a binary
> search, would that lose enough performance with small numbers of
> segments that anyone would care?  If so, maybe we need to use linear
> search for small numbers of segments and switch to binary search with
> larger numbers of segments.

I just went and tested.

I implemented the hybrid binary search attached, and ran a few tests
with and without the sequential code enabled, at small scales.

The difference is statistically significant, but small (less than 3%).
With proper optimization of the binary search, however, the difference
flips:

claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80
fullbinary.s100.times
vacuum_bench_s100.1.shufp80.log:CPU: user: 6.20 s, system: 1.42 s,
elapsed: 18.34 s.
vacuum_bench_s100.2.shufp80.log:CPU: user: 6.44 s, system: 1.40 s,
elapsed: 19.75 s.
vacuum_bench_s100.3.shufp80.log:CPU: user: 6.28 s, system: 1.41 s,
elapsed: 18.48 s.
vacuum_bench_s100.4.shufp80.log:CPU: user: 6.39 s, system: 1.51 s,
elapsed: 20.60 s.
vacuum_bench_s100.5.shufp80.log:CPU: user: 6.26 s, system: 1.42 s,
elapsed: 19.16 s.

claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80
hybridbinary.s100.times
vacuum_bench_s100.1.shufp80.log:CPU: user: 6.49 s, system: 1.39 s,
elapsed: 19.15 s.
vacuum_bench_s100.2.shufp80.log:CPU: user: 6.36 s, system: 1.33 s,
elapsed: 18.40 s.
vacuum_bench_s100.3.shufp80.log:CPU: user: 6.36 s, system: 1.31 s,
elapsed: 18.87 s.
vacuum_bench_s100.4.shufp80.log:CPU: user: 6.59 s, system: 1.35 s,
elapsed: 26.43 s.
vacuum_bench_s100.5.shufp80.log:CPU: user: 6.54 s, system: 1.28 s,
elapsed: 20.02 s.

That's after inlining the compare on both the linear and sequential
code, and it seems it lets the compiler optimize the binary search to
the point where it outperforms the sequential search.

That's not the case when the compare isn't inlined.

That seems in line with [1], that show the impact of various
optimizations on both algorithms. It's clearly a close enough race
that optimizations play a huge role.

Since we're not likely to go and implement SSE2-optimized versions, I
believe I'll leave the binary search only. That's the attached patch
set.

I'm running the full test suite, but that takes a very long while.
I'll post the results when they're done.

[1] https://schani.wordpress.com/2010/04/30/linear-vs-binary-search/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Thu, Apr 20, 2017 at 5:24 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> What's not clear to me is how sensitive the performance of vacuum is
>> to the number of cycles used here.  For a large index, the number of
>> searches will presumably be quite large, so it does seem worth
>> worrying about performance.  But if we just always used a binary
>> search, would that lose enough performance with small numbers of
>> segments that anyone would care?  If so, maybe we need to use linear
>> search for small numbers of segments and switch to binary search with
>> larger numbers of segments.
>
> I just went and tested.

Thanks!

> That's after inlining the compare on both the linear and sequential
> code, and it seems it lets the compiler optimize the binary search to
> the point where it outperforms the sequential search.
>
> That's not the case when the compare isn't inlined.
>
> That seems in line with [1], that show the impact of various
> optimizations on both algorithms. It's clearly a close enough race
> that optimizations play a huge role.
>
> Since we're not likely to go and implement SSE2-optimized versions, I
> believe I'll leave the binary search only. That's the attached patch
> set.

That sounds reasonable based on your test results.  I guess part of
what I was wondering is whether a vacuum on a table large enough to
require multiple gigabytes of work_mem isn't likely to be I/O-bound
anyway.  If so, a few cycles one way or the other other isn't likely
to matter much.  If not, where exactly are all of those CPU cycles
going?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Sun, Apr 23, 2017 at 12:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> That's after inlining the compare on both the linear and sequential
>> code, and it seems it lets the compiler optimize the binary search to
>> the point where it outperforms the sequential search.
>>
>> That's not the case when the compare isn't inlined.
>>
>> That seems in line with [1], that show the impact of various
>> optimizations on both algorithms. It's clearly a close enough race
>> that optimizations play a huge role.
>>
>> Since we're not likely to go and implement SSE2-optimized versions, I
>> believe I'll leave the binary search only. That's the attached patch
>> set.
>
> That sounds reasonable based on your test results.  I guess part of
> what I was wondering is whether a vacuum on a table large enough to
> require multiple gigabytes of work_mem isn't likely to be I/O-bound
> anyway.  If so, a few cycles one way or the other other isn't likely
> to matter much.  If not, where exactly are all of those CPU cycles
> going?

I haven't been able to produce a table large enough to get a CPU-bound
vacuum, so such a case is likely to require huge storage and a very
powerful I/O system. Mine can only get about 100MB/s tops, and at that
speed, vacuum is I/O bound even for multi-GB work_mem. That's why I've
been using the reported CPU time as benchmark.

BTW, I left the benchmark script running all weekend at the office,
and when I got back a power outage had aborted it. In a few days I'll
be out on vacation, so I'm not sure I'll get the benchmark results
anytime soon. But this patch moved to 11.0 I guess there's no rush.

Just FTR, in case I leave before the script is done, the script got to
scale 400 before the outage:

INFO:  vacuuming "public.pgbench_accounts"
INFO:  scanned index "pgbench_accounts_pkey" to remove 40000000 row versions
DETAIL:  CPU: user: 5.94 s, system: 1.26 s, elapsed: 26.77 s.
INFO:  "pgbench_accounts": removed 40000000 row versions in 655739 pages
DETAIL:  CPU: user: 3.36 s, system: 2.57 s, elapsed: 61.67 s.
INFO:  index "pgbench_accounts_pkey" now contains 0 row versions in 109679 pages
DETAIL:  40000000 index row versions were removed.
109289 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.06 s.
INFO:  "pgbench_accounts": found 38925546 removable, 0 nonremovable
row versions in 655738 out of 655738 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 1098
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 15.34 s, system: 6.95 s, elapsed: 126.21 s.
INFO:  "pgbench_accounts": truncated 655738 to 0 pages
DETAIL:  CPU: user: 0.22 s, system: 2.10 s, elapsed: 8.10 s.

In summary:

binsrch v10:

s100: CPU: user: 3.02 s, system: 1.51 s, elapsed: 16.43 s.
s400: CPU: user: 15.34 s, system: 6.95 s, elapsed: 126.21 s.

The old results:

Old Patched (sequential search):

s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s.
s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s.
s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s.

Unpatched:

s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s.
s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s.
s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s.

I wouldn't fret over the slight slowdown vs the old patch, it could be
noise (the script only completed a single run at scale 400).



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Mon, Apr 24, 2017 at 3:57 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> I wouldn't fret over the slight slowdown vs the old patch, it could be
> noise (the script only completed a single run at scale 400).

Yeah, seems fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Masahiko Sawada
Date:
On Fri, Apr 21, 2017 at 6:24 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Wed, Apr 12, 2017 at 4:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Apr 11, 2017 at 4:38 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> In essence, the patch as it is proposed, doesn't *need* a binary
>>> search, because the segment list can only grow up to 15 segments at
>>> its biggest, and that's a size small enough that linear search will
>>> outperform (or at least perform as well as) binary search. Reducing
>>> the initial segment size wouldn't change that. If the 12GB limit is
>>> lifted, or the maximum segment size reduced (from 1GB to 128MB for
>>> example), however, that would change.
>>>
>>> I'd be more in favor of lifting the 12GB limit than of reducing the
>>> maximum segment size, for the reasons above. Raising the 12GB limit
>>> has concrete and readily apparent benefits, whereas using bigger (or
>>> smaller) segments is far more debatable. Yes, that will need a binary
>>> search. But, I was hoping that could be a second (or third) patch, to
>>> keep things simple, and benefits measurable.
>>
>> To me, it seems a bit short-sighted to say, OK, let's use a linear
>> search because there's this 12GB limit so we can limit ourselves to 15
>> segments.  Because somebody will want to remove that 12GB limit, and
>> then we'll have to revisit the whole thing anyway.  I think, anyway.
>
> Ok, attached an updated patch that implements the binary search
>
>> What's not clear to me is how sensitive the performance of vacuum is
>> to the number of cycles used here.  For a large index, the number of
>> searches will presumably be quite large, so it does seem worth
>> worrying about performance.  But if we just always used a binary
>> search, would that lose enough performance with small numbers of
>> segments that anyone would care?  If so, maybe we need to use linear
>> search for small numbers of segments and switch to binary search with
>> larger numbers of segments.
>
> I just went and tested.
>
> I implemented the hybrid binary search attached, and ran a few tests
> with and without the sequential code enabled, at small scales.
>
> The difference is statistically significant, but small (less than 3%).
> With proper optimization of the binary search, however, the difference
> flips:
>
> claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80
> fullbinary.s100.times
> vacuum_bench_s100.1.shufp80.log:CPU: user: 6.20 s, system: 1.42 s,
> elapsed: 18.34 s.
> vacuum_bench_s100.2.shufp80.log:CPU: user: 6.44 s, system: 1.40 s,
> elapsed: 19.75 s.
> vacuum_bench_s100.3.shufp80.log:CPU: user: 6.28 s, system: 1.41 s,
> elapsed: 18.48 s.
> vacuum_bench_s100.4.shufp80.log:CPU: user: 6.39 s, system: 1.51 s,
> elapsed: 20.60 s.
> vacuum_bench_s100.5.shufp80.log:CPU: user: 6.26 s, system: 1.42 s,
> elapsed: 19.16 s.
>
> claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80
> hybridbinary.s100.times
> vacuum_bench_s100.1.shufp80.log:CPU: user: 6.49 s, system: 1.39 s,
> elapsed: 19.15 s.
> vacuum_bench_s100.2.shufp80.log:CPU: user: 6.36 s, system: 1.33 s,
> elapsed: 18.40 s.
> vacuum_bench_s100.3.shufp80.log:CPU: user: 6.36 s, system: 1.31 s,
> elapsed: 18.87 s.
> vacuum_bench_s100.4.shufp80.log:CPU: user: 6.59 s, system: 1.35 s,
> elapsed: 26.43 s.
> vacuum_bench_s100.5.shufp80.log:CPU: user: 6.54 s, system: 1.28 s,
> elapsed: 20.02 s.
>
> That's after inlining the compare on both the linear and sequential
> code, and it seems it lets the compiler optimize the binary search to
> the point where it outperforms the sequential search.
>
> That's not the case when the compare isn't inlined.
>
> That seems in line with [1], that show the impact of various
> optimizations on both algorithms. It's clearly a close enough race
> that optimizations play a huge role.
>
> Since we're not likely to go and implement SSE2-optimized versions, I
> believe I'll leave the binary search only. That's the attached patch
> set.
>
> I'm running the full test suite, but that takes a very long while.
> I'll post the results when they're done.
>
> [1] https://schani.wordpress.com/2010/04/30/linear-vs-binary-search/

Thank you for updating the patch.

I've read this patch again and here are some review comments.

+ * Lookup in that structure proceeds sequentially in the list of segments,
+ * and with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(log N) lookup complexity (2 log N to be
+ * precise).

IIUC we now do binary search even over the list of segments.

-----

We often fetch a particular dead tuple segment. How about providing a
macro for easier understanding?
For example,
#define GetDeadTuplsSegment(lvrelstats, seg) \ (&(lvrelstats)->dead_tuples.dt_segments[(seg)])

-----

+       if (vacrelstats->dead_tuples.num_segs == 0)
+               return;
+

+       /* If uninitialized, we have no tuples to delete from the indexes */
+       if (vacrelstats->dead_tuples.num_segs == 0)
+       {
+               return;
+       }

+       if (vacrelstats->dead_tuples.num_segs == 0)
+               return false;
+

As I listed, there is code to check if dead tuple is initialized
already in some places where doing actual vacuum.
I guess that it should not happen that we attempt to vacuum a
table/index page while not having any dead tuple. Is it better to have
Assert or ereport instead?

-----

@@ -1915,2 +2002,2 @@ count_nondeletable_pages(Relation onerel,
LVRelStats *vacrelstats)
-                       BlockNumber     prefetchStart;
-                       BlockNumber     pblkno;
+                       BlockNumber prefetchStart;
+                       BlockNumber pblkno;

I think that it's a unnecessary change.

-----

+       /* Search for the segment likely to contain the item pointer */
+       iseg = vac_itemptr_binsrch(
+               (void *) itemptr,
+               (void *)
&(vacrelstats->dead_tuples.dt_segments->last_dead_tuple),
+               vacrelstats->dead_tuples.last_seg + 1,
+               sizeof(DeadTuplesSegment));
+

I think that we can change the above to;

+       /* Search for the segment likely to contain the item pointer */
+       iseg = vac_itemptr_binsrch(
+               (void *) itemptr,
+               (void *) &(seg->last_dead_tuple),
+               vacrelstats->dead_tuples.last_seg + 1,
+               sizeof(DeadTuplesSegment));

We set "seg = vacrelstats->dead_tuples.dt_segments" just before this.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
Sorry for the delay, I had extended vacations that kept me away from
my test rigs, and afterward testing too, liteally, a few weeks.

I built a more thoroguh test script that produced some interesting
results. Will attach the results.

For now, to the review comments:

On Thu, Apr 27, 2017 at 4:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I've read this patch again and here are some review comments.
>
> + * Lookup in that structure proceeds sequentially in the list of segments,
> + * and with a binary search within each segment. Since segment's size grows
> + * exponentially, this retains O(log N) lookup complexity (2 log N to be
> + * precise).
>
> IIUC we now do binary search even over the list of segments.

Right

>
> -----
>
> We often fetch a particular dead tuple segment. How about providing a
> macro for easier understanding?
> For example,
>
>  #define GetDeadTuplsSegment(lvrelstats, seg) \
>   (&(lvrelstats)->dead_tuples.dt_segments[(seg)])
>
> -----
>
> +       if (vacrelstats->dead_tuples.num_segs == 0)
> +               return;
> +
>
> +       /* If uninitialized, we have no tuples to delete from the indexes */
> +       if (vacrelstats->dead_tuples.num_segs == 0)
> +       {
> +               return;
> +       }
>
> +       if (vacrelstats->dead_tuples.num_segs == 0)
> +               return false;
> +

Ok

> As I listed, there is code to check if dead tuple is initialized
> already in some places where doing actual vacuum.
> I guess that it should not happen that we attempt to vacuum a
> table/index page while not having any dead tuple. Is it better to have
> Assert or ereport instead?

I'm not sure. Having a non-empty dead tuples array is not necessary to
be able to honor the contract in the docstring. Most of those functions
clean up the heap/index of dead tuples given the array of dead tuples,
which is a no-op for an empty array.

The code that calls those functions doesn't bother calling if the array
is known empty, true, but there's no compelling reason to enforce that at the
interface. Doing so could cause subtle bugs rather than catch them
(in the form of unexpected assertion failures, if some caller forgot to
check the dead tuples array for emptiness).

If you're worried about the possibility that some bugs fails to record
dead tuples in the array, and thus makes VACUUM silently ineffective,
I instead added a test for that case. This should be a better approach,
since it's more likely to catch unexpected failure modes than an assert.

> @@ -1915,2 +2002,2 @@ count_nondeletable_pages(Relation onerel,
> LVRelStats *vacrelstats)
> -                       BlockNumber     prefetchStart;
> -                       BlockNumber     pblkno;
> +                       BlockNumber prefetchStart;
> +                       BlockNumber pblkno;
>
> I think that it's a unnecessary change.

Yep. But funnily that's how it's now in master.

>
> -----
>
> +       /* Search for the segment likely to contain the item pointer */
> +       iseg = vac_itemptr_binsrch(
> +               (void *) itemptr,
> +               (void *)
> &(vacrelstats->dead_tuples.dt_segments->last_dead_tuple),
> +               vacrelstats->dead_tuples.last_seg + 1,
> +               sizeof(DeadTuplesSegment));
> +
>
> I think that we can change the above to;
>
> +       /* Search for the segment likely to contain the item pointer */
> +       iseg = vac_itemptr_binsrch(
> +               (void *) itemptr,
> +               (void *) &(seg->last_dead_tuple),
> +               vacrelstats->dead_tuples.last_seg + 1,
> +               sizeof(DeadTuplesSegment));
>
> We set "seg = vacrelstats->dead_tuples.dt_segments" just before this.

Right

Attached is a current version of both patches, rebased since we're at it.

I'm also attaching the output from the latest benchmark runs, in raw
(tar.bz2) and digested (bench_report) forms, the script used to run
them (vacuumbench.sh) and to produce the reports
(vacuum_bench_report.sh).

Those are before the changes in the review. While I don't expect any
change, I'll re-run some of them just in case, and try to investigate
the slowdown. But that will take forever. Each run takes about a week
on my test rig, and I don't have enough hardware to parallelize the
tests. I will run a test on a snapshot of a particularly troublesome
production database we have, that should be interesting.

The benchmarks show a consistent improvement at scale 400, which may
be related to the search implementation being better somehow, and a
slowdown at scale 4000 in some variants. I believe this is due to
those variants having highly clustered indexes. While the "shuf"
(shuffled) variantes were intended to be the opposite of that, I
suspect I somehow failed to get the desired outcome, so I'll be
double-checking that.

In any case the slowdown is only materialized when vacuuming with a
large mwm setting, which is something that shouldn't happen
unintentionally.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Fwd: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
Resending without the .tar.bz2 that get blocked


Sorry for the delay, I had extended vacations that kept me away from
my test rigs, and afterward testing too, liteally, a few weeks.

I built a more thoroguh test script that produced some interesting
results. Will attach the results.

For now, to the review comments:

On Thu, Apr 27, 2017 at 4:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I've read this patch again and here are some review comments.
>
> + * Lookup in that structure proceeds sequentially in the list of segments,
> + * and with a binary search within each segment. Since segment's size grows
> + * exponentially, this retains O(log N) lookup complexity (2 log N to be
> + * precise).
>
> IIUC we now do binary search even over the list of segments.

Right

>
> -----
>
> We often fetch a particular dead tuple segment. How about providing a
> macro for easier understanding?
> For example,
>
>  #define GetDeadTuplsSegment(lvrelstats, seg) \
>   (&(lvrelstats)->dead_tuples.dt_segments[(seg)])
>
> -----
>
> +       if (vacrelstats->dead_tuples.num_segs == 0)
> +               return;
> +
>
> +       /* If uninitialized, we have no tuples to delete from the indexes */
> +       if (vacrelstats->dead_tuples.num_segs == 0)
> +       {
> +               return;
> +       }
>
> +       if (vacrelstats->dead_tuples.num_segs == 0)
> +               return false;
> +

Ok

> As I listed, there is code to check if dead tuple is initialized
> already in some places where doing actual vacuum.
> I guess that it should not happen that we attempt to vacuum a
> table/index page while not having any dead tuple. Is it better to have
> Assert or ereport instead?

I'm not sure. Having a non-empty dead tuples array is not necessary to
be able to honor the contract in the docstring. Most of those functions
clean up the heap/index of dead tuples given the array of dead tuples,
which is a no-op for an empty array.

The code that calls those functions doesn't bother calling if the array
is known empty, true, but there's no compelling reason to enforce that at the
interface. Doing so could cause subtle bugs rather than catch them
(in the form of unexpected assertion failures, if some caller forgot to
check the dead tuples array for emptiness).

If you're worried about the possibility that some bugs fails to record
dead tuples in the array, and thus makes VACUUM silently ineffective,
I instead added a test for that case. This should be a better approach,
since it's more likely to catch unexpected failure modes than an assert.

> @@ -1915,2 +2002,2 @@ count_nondeletable_pages(Relation onerel,
> LVRelStats *vacrelstats)
> -                       BlockNumber     prefetchStart;
> -                       BlockNumber     pblkno;
> +                       BlockNumber prefetchStart;
> +                       BlockNumber pblkno;
>
> I think that it's a unnecessary change.

Yep. But funnily that's how it's now in master.

>
> -----
>
> +       /* Search for the segment likely to contain the item pointer */
> +       iseg = vac_itemptr_binsrch(
> +               (void *) itemptr,
> +               (void *)
> &(vacrelstats->dead_tuples.dt_segments->last_dead_tuple),
> +               vacrelstats->dead_tuples.last_seg + 1,
> +               sizeof(DeadTuplesSegment));
> +
>
> I think that we can change the above to;
>
> +       /* Search for the segment likely to contain the item pointer */
> +       iseg = vac_itemptr_binsrch(
> +               (void *) itemptr,
> +               (void *) &(seg->last_dead_tuple),
> +               vacrelstats->dead_tuples.last_seg + 1,
> +               sizeof(DeadTuplesSegment));
>
> We set "seg = vacrelstats->dead_tuples.dt_segments" just before this.

Right

Attached is a current version of both patches, rebased since we're at it.

I'm also attaching the output from the latest benchmark runs, in raw
(tar.bz2) and digested (bench_report) forms, the script used to run
them (vacuumbench.sh) and to produce the reports
(vacuum_bench_report.sh).

Those are before the changes in the review. While I don't expect any
change, I'll re-run some of them just in case, and try to investigate
the slowdown. But that will take forever. Each run takes about a week
on my test rig, and I don't have enough hardware to parallelize the
tests. I will run a test on a snapshot of a particularly troublesome
production database we have, that should be interesting.

The benchmarks show a consistent improvement at scale 400, which may
be related to the search implementation being better somehow, and a
slowdown at scale 4000 in some variants. I believe this is due to
those variants having highly clustered indexes. While the "shuf"
(shuffled) variantes were intended to be the opposite of that, I
suspect I somehow failed to get the desired outcome, so I'll be
double-checking that.

In any case the slowdown is only materialized when vacuuming with a
large mwm setting, which is something that shouldn't happen
unintentionally.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: Fwd: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Alexey Chernyshov
Date:
Thank you for the patch and benchmark results, I have a couple remarks.
Firstly, padding in DeadTuplesSegment

typedef struct DeadTuplesSegment

{
    ItemPointerData last_dead_tuple;    /* Copy of the last dead tuple 
(unset
                                         * until the segment is fully
                                         * populated). Keep it first to 
simplify
                                         * binary searches */
    unsigned short padding;        /* Align dt_tids to 32-bits,
                                 * sizeof(ItemPointerData) is aligned to
                                 * short, so add a padding short, to 
make the
                                 * size of DeadTuplesSegment a multiple of
                                 * 32-bits and align integer components 
for
                                 * better performance during lookups 
into the
                                 * multiarray */
    int            num_dead_tuples;    /* # of entries in the segment */
    int            max_dead_tuples;    /* # of entries allocated in the 
segment */
    ItemPointer dt_tids;        /* Array of dead tuples */

}    DeadTuplesSegment;

In the comments to ItemPointerData is written that it is 6 bytes long, 
but can be padded to 8 bytes by some compilers, so if we add padding in 
a current way, there is no guaranty that it will be done as it is 
expected. The other way to do it with pg_attribute_alligned. But in my 
opinion, there is no need to do it manually, because the compiler will 
do this optimization itself.

On 11.07.2017 19:51, Claudio Freire wrote:
>> -----
>>
>> +       /* Search for the segment likely to contain the item pointer */
>> +       iseg = vac_itemptr_binsrch(
>> +               (void *) itemptr,
>> +               (void *)
>> &(vacrelstats->dead_tuples.dt_segments->last_dead_tuple),
>> +               vacrelstats->dead_tuples.last_seg + 1,
>> +               sizeof(DeadTuplesSegment));
>> +
>>
>> I think that we can change the above to;
>>
>> +       /* Search for the segment likely to contain the item pointer */
>> +       iseg = vac_itemptr_binsrch(
>> +               (void *) itemptr,
>> +               (void *) &(seg->last_dead_tuple),
>> +               vacrelstats->dead_tuples.last_seg + 1,
>> +               sizeof(DeadTuplesSegment));
>>
>> We set "seg = vacrelstats->dead_tuples.dt_segments" just before this.
> Right
In my mind, if you change vacrelstats->dead_tuples.last_seg + 1 with 
GetNumDeadTuplesSegments(vacrelstats), it would be more meaningful.
Besides, you can change the vac_itemptr_binsrch within the segment with 
stdlib bsearch, like:
    res = (ItemPointer) bsearch((void *) itemptr,
                                (void *) seg->dt_tids,
                                seg->num_dead_tuples,
                                sizeof(ItemPointerData),
                                vac_cmp_itemptr);
    return (res != NULL);

> Those are before the changes in the review. While I don't expect any
> change, I'll re-run some of them just in case, and try to investigate
> the slowdown. But that will take forever. Each run takes about a week
> on my test rig, and I don't have enough hardware to parallelize the
> tests. I will run a test on a snapshot of a particularly troublesome
> production database we have, that should be interesting.
Very interesting, waiting for the results.



Re: Fwd: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Jul 12, 2017 at 11:48 AM, Alexey Chernyshov
<a.chernyshov@postgrespro.ru> wrote:
> Thank you for the patch and benchmark results, I have a couple remarks.
> Firstly, padding in DeadTuplesSegment
>
> typedef struct DeadTuplesSegment
>
> {
>
>     ItemPointerData last_dead_tuple;    /* Copy of the last dead tuple
> (unset
>
>                                          * until the segment is fully
>
>                                          * populated). Keep it first to
> simplify
>
>                                          * binary searches */
>
>     unsigned short padding;        /* Align dt_tids to 32-bits,
>
>                                  * sizeof(ItemPointerData) is aligned to
>
>                                  * short, so add a padding short, to make
> the
>
>                                  * size of DeadTuplesSegment a multiple of
>
>                                  * 32-bits and align integer components for
>
>                                  * better performance during lookups into
> the
>
>                                  * multiarray */
>
>     int            num_dead_tuples;    /* # of entries in the segment */
>
>     int            max_dead_tuples;    /* # of entries allocated in the
> segment */
>
>     ItemPointer dt_tids;        /* Array of dead tuples */
>
> }    DeadTuplesSegment;
>
> In the comments to ItemPointerData is written that it is 6 bytes long, but
> can be padded to 8 bytes by some compilers, so if we add padding in a
> current way, there is no guaranty that it will be done as it is expected.
> The other way to do it with pg_attribute_alligned. But in my opinion, there
> is no need to do it manually, because the compiler will do this optimization
> itself.

I'll look into it. But my experience is that compilers won't align
struct size like this, only attributes, and this attribute is composed
of 16-bit attributes so it doesn't get aligned by default.

> On 11.07.2017 19:51, Claudio Freire wrote:
>>>
>>> -----
>>>
>>> +       /* Search for the segment likely to contain the item pointer */
>>> +       iseg = vac_itemptr_binsrch(
>>> +               (void *) itemptr,
>>> +               (void *)
>>> &(vacrelstats->dead_tuples.dt_segments->last_dead_tuple),
>>> +               vacrelstats->dead_tuples.last_seg + 1,
>>> +               sizeof(DeadTuplesSegment));
>>> +
>>>
>>> I think that we can change the above to;
>>>
>>> +       /* Search for the segment likely to contain the item pointer */
>>> +       iseg = vac_itemptr_binsrch(
>>> +               (void *) itemptr,
>>> +               (void *) &(seg->last_dead_tuple),
>>> +               vacrelstats->dead_tuples.last_seg + 1,
>>> +               sizeof(DeadTuplesSegment));
>>>
>>> We set "seg = vacrelstats->dead_tuples.dt_segments" just before this.
>>
>> Right
>
> In my mind, if you change vacrelstats->dead_tuples.last_seg + 1 with
> GetNumDeadTuplesSegments(vacrelstats), it would be more meaningful.

It's not the same thing. The first run it might, but after a reset of
the multiarray, num segments is the allocated size, while last_seg is
the last one filled with data.

> Besides, you can change the vac_itemptr_binsrch within the segment with
> stdlib bsearch, like:
>
>     res = (ItemPointer) bsearch((void *) itemptr,
>
>                                 (void *) seg->dt_tids,
>
>                                 seg->num_dead_tuples,
>
>                                 sizeof(ItemPointerData),
>
>                                 vac_cmp_itemptr);
>
>     return (res != NULL);

The stdlib's bsearch is quite slower. The custom bsearch inlines the
comparison making it able to factor out of the loop quite a bit of
logic, and in general generate far more specialized assembly.

For the compiler to optimize the stdlib's bsearch call, whole-program
optimization should be enabled, something that is unlikely. Even then,
it may not be able to, due to aliasing rules.

This is what I came up to make the new approach's performance on par
or better than the old one, in CPU cycles. In fact, benchmarks show
that time spent on the CPU is lower now, in large part, due to this.

It's not like it's the first custom binary search in postgres, also.



Re: Fwd: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Jul 12, 2017 at 1:08 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Wed, Jul 12, 2017 at 11:48 AM, Alexey Chernyshov
> <a.chernyshov@postgrespro.ru> wrote:
>> Thank you for the patch and benchmark results, I have a couple remarks.
>> Firstly, padding in DeadTuplesSegment
>>
>> typedef struct DeadTuplesSegment
>>
>> {
>>
>>     ItemPointerData last_dead_tuple;    /* Copy of the last dead tuple
>> (unset
>>
>>                                          * until the segment is fully
>>
>>                                          * populated). Keep it first to
>> simplify
>>
>>                                          * binary searches */
>>
>>     unsigned short padding;        /* Align dt_tids to 32-bits,
>>
>>                                  * sizeof(ItemPointerData) is aligned to
>>
>>                                  * short, so add a padding short, to make
>> the
>>
>>                                  * size of DeadTuplesSegment a multiple of
>>
>>                                  * 32-bits and align integer components for
>>
>>                                  * better performance during lookups into
>> the
>>
>>                                  * multiarray */
>>
>>     int            num_dead_tuples;    /* # of entries in the segment */
>>
>>     int            max_dead_tuples;    /* # of entries allocated in the
>> segment */
>>
>>     ItemPointer dt_tids;        /* Array of dead tuples */
>>
>> }    DeadTuplesSegment;
>>
>> In the comments to ItemPointerData is written that it is 6 bytes long, but
>> can be padded to 8 bytes by some compilers, so if we add padding in a
>> current way, there is no guaranty that it will be done as it is expected.
>> The other way to do it with pg_attribute_alligned. But in my opinion, there
>> is no need to do it manually, because the compiler will do this optimization
>> itself.
>
> I'll look into it. But my experience is that compilers won't align
> struct size like this, only attributes, and this attribute is composed
> of 16-bit attributes so it doesn't get aligned by default.

Doing sizeof(DeadTuplesSegment) suggests you were indeed right, at
least in GCC. I'll remove the padding.

Seems I just got the wrong impression at some point.



Re: Fwd: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Jul 12, 2017 at 1:29 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Wed, Jul 12, 2017 at 1:08 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Wed, Jul 12, 2017 at 11:48 AM, Alexey Chernyshov
>> <a.chernyshov@postgrespro.ru> wrote:
>>> Thank you for the patch and benchmark results, I have a couple remarks.
>>> Firstly, padding in DeadTuplesSegment
>>>
>>> typedef struct DeadTuplesSegment
>>>
>>> {
>>>
>>>     ItemPointerData last_dead_tuple;    /* Copy of the last dead tuple
>>> (unset
>>>
>>>                                          * until the segment is fully
>>>
>>>                                          * populated). Keep it first to
>>> simplify
>>>
>>>                                          * binary searches */
>>>
>>>     unsigned short padding;        /* Align dt_tids to 32-bits,
>>>
>>>                                  * sizeof(ItemPointerData) is aligned to
>>>
>>>                                  * short, so add a padding short, to make
>>> the
>>>
>>>                                  * size of DeadTuplesSegment a multiple of
>>>
>>>                                  * 32-bits and align integer components for
>>>
>>>                                  * better performance during lookups into
>>> the
>>>
>>>                                  * multiarray */
>>>
>>>     int            num_dead_tuples;    /* # of entries in the segment */
>>>
>>>     int            max_dead_tuples;    /* # of entries allocated in the
>>> segment */
>>>
>>>     ItemPointer dt_tids;        /* Array of dead tuples */
>>>
>>> }    DeadTuplesSegment;
>>>
>>> In the comments to ItemPointerData is written that it is 6 bytes long, but
>>> can be padded to 8 bytes by some compilers, so if we add padding in a
>>> current way, there is no guaranty that it will be done as it is expected.
>>> The other way to do it with pg_attribute_alligned. But in my opinion, there
>>> is no need to do it manually, because the compiler will do this optimization
>>> itself.
>>
>> I'll look into it. But my experience is that compilers won't align
>> struct size like this, only attributes, and this attribute is composed
>> of 16-bit attributes so it doesn't get aligned by default.
>
> Doing sizeof(DeadTuplesSegment) suggests you were indeed right, at
> least in GCC. I'll remove the padding.
>
> Seems I just got the wrong impression at some point.

Updated versions of the patches attached.

A few runs of the benchmark show no significant difference, as it
should (being all cosmetic changes).

The bigger benchmark will take longer.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Fri, Apr 7, 2017 at 10:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Indeed they do, and that's what motivated this patch. But I'd need
> TB-sized tables to set up something like that. I don't have the
> hardware or time available to do that (vacuum on bloated TB-sized
> tables can take days in my experience). Scale 4000 is as big as I can
> get without running out of space for the tests in my test hardware.
>
> If anybody else has the ability, I'd be thankful if they did test it
> under those conditions, but I cannot. I think Anastasia's test is
> closer to such a test, that's probably why it shows a bigger
> improvement in total elapsed time.
>
> Our production database could possibly be used, but it can take about
> a week to clone it, upgrade it (it's 9.5 currently), and run the
> relevant vacuum.

It looks like I won't be able to do that test with a production
snapshot anytime soon.

Getting approval for the budget required to do that looks like it's
going to take far longer than I thought.

Regardless of that, I think the patch can move forward. I'm still
planning to do the test at some point, but this patch shouldn't block
on it.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Daniel Gustafsson
Date:
> On 18 Aug 2017, at 13:39, Claudio Freire <klaussfreire@gmail.com> wrote:
>
> On Fri, Apr 7, 2017 at 10:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Indeed they do, and that's what motivated this patch. But I'd need
>> TB-sized tables to set up something like that. I don't have the
>> hardware or time available to do that (vacuum on bloated TB-sized
>> tables can take days in my experience). Scale 4000 is as big as I can
>> get without running out of space for the tests in my test hardware.
>>
>> If anybody else has the ability, I'd be thankful if they did test it
>> under those conditions, but I cannot. I think Anastasia's test is
>> closer to such a test, that's probably why it shows a bigger
>> improvement in total elapsed time.
>>
>> Our production database could possibly be used, but it can take about
>> a week to clone it, upgrade it (it's 9.5 currently), and run the
>> relevant vacuum.
>
> It looks like I won't be able to do that test with a production
> snapshot anytime soon.
>
> Getting approval for the budget required to do that looks like it's
> going to take far longer than I thought.
>
> Regardless of that, I think the patch can move forward. I'm still
> planning to do the test at some point, but this patch shouldn't block
> on it.

This patch has been marked Ready for committer after review, but wasn’t
committed in the current commitfest so it will be moved to the next.  Since it
no longer applies cleanly, it’s being reset to Waiting for author though.

cheers ./daniel

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Sun, Oct 1, 2017 at 8:36 PM, Daniel Gustafsson <daniel@yesql.se> wrote:
>> On 18 Aug 2017, at 13:39, Claudio Freire <klaussfreire@gmail.com> wrote:
>>
>> On Fri, Apr 7, 2017 at 10:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> Indeed they do, and that's what motivated this patch. But I'd need
>>> TB-sized tables to set up something like that. I don't have the
>>> hardware or time available to do that (vacuum on bloated TB-sized
>>> tables can take days in my experience). Scale 4000 is as big as I can
>>> get without running out of space for the tests in my test hardware.
>>>
>>> If anybody else has the ability, I'd be thankful if they did test it
>>> under those conditions, but I cannot. I think Anastasia's test is
>>> closer to such a test, that's probably why it shows a bigger
>>> improvement in total elapsed time.
>>>
>>> Our production database could possibly be used, but it can take about
>>> a week to clone it, upgrade it (it's 9.5 currently), and run the
>>> relevant vacuum.
>>
>> It looks like I won't be able to do that test with a production
>> snapshot anytime soon.
>>
>> Getting approval for the budget required to do that looks like it's
>> going to take far longer than I thought.
>>
>> Regardless of that, I think the patch can move forward. I'm still
>> planning to do the test at some point, but this patch shouldn't block
>> on it.
>
> This patch has been marked Ready for committer after review, but wasn’t
> committed in the current commitfest so it will be moved to the next.  Since it
> no longer applies cleanly, it’s being reset to Waiting for author though.
>
> cheers ./daniel

Rebased version of the patches attached

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Michael Paquier
Date:
On Mon, Oct 2, 2017 at 11:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Rebased version of the patches attached

The status of the patch is misleading:
https://commitfest.postgresql.org/15/844/. This was marked as waiting
on author but a new version has been published. Let's be careful.

The last patches I am aware of, aka those from
https://www.postgresql.org/message-id/CAGTBQpZHTf2JtShC=ijc9wzEipo3XOKWQhx+8WiP7ZjPC3FBEg@mail.gmail.com,
do not apply. I am moving the patch to the next commit fest with a
waiting on author status, as this should be reviewed, but those need a
rebase.
-- 
Michael


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Tue, Nov 28, 2017 at 10:37 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Oct 2, 2017 at 11:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Rebased version of the patches attached
>
> The status of the patch is misleading:
> https://commitfest.postgresql.org/15/844/. This was marked as waiting
> on author but a new version has been published. Let's be careful.
>
> The last patches I am aware of, aka those from
> https://www.postgresql.org/message-id/CAGTBQpZHTf2JtShC=ijc9wzEipo3XOKWQhx+8WiP7ZjPC3FBEg@mail.gmail.com,
> do not apply. I am moving the patch to the next commit fest with a
> waiting on author status, as this should be reviewed, but those need a
> rebase.

They did apply at the time, but I think major work on vacuum was
pushed since then, and also I was traveling so out of reach.

It may take some time to rebase them again. Should I move to needs
review myself after that?


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Michael Paquier
Date:
On Mon, Dec 4, 2017 at 2:38 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> They did apply at the time, but I think major work on vacuum was
> pushed since then, and also I was traveling so out of reach.
>
> It may take some time to rebase them again. Should I move to needs
> review myself after that?

Sure, if you can get into this state, please feel free to update the
status of the patch yourself.
-- 
Michael


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Stephen Frost
Date:
Greetings,

* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Mon, Dec 4, 2017 at 2:38 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > They did apply at the time, but I think major work on vacuum was
> > pushed since then, and also I was traveling so out of reach.
> >
> > It may take some time to rebase them again. Should I move to needs
> > review myself after that?
>
> Sure, if you can get into this state, please feel free to update the
> status of the patch yourself.

We're now over a month since this status update- Claudio, for this to
have a chance during this commitfest to be included (which, personally,
I think would be great as it solves a pretty serious issue..), we really
need to have it be rebased and updated.  Once that's done, as Michael
says, please change the patch status back to 'Needs Review'.

Thanks!

Stephen

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:


On Sat, Jan 6, 2018 at 7:35 PM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings,

* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Mon, Dec 4, 2017 at 2:38 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > They did apply at the time, but I think major work on vacuum was
> > pushed since then, and also I was traveling so out of reach.
> >
> > It may take some time to rebase them again. Should I move to needs
> > review myself after that?
>
> Sure, if you can get into this state, please feel free to update the
> status of the patch yourself.

We're now over a month since this status update- Claudio, for this to
have a chance during this commitfest to be included (which, personally,
I think would be great as it solves a pretty serious issue..), we really
need to have it be rebased and updated.  Once that's done, as Michael
says, please change the patch status back to 'Needs Review'.

Sorry, had tons of other stuff that took priority.

I'll get to rebase this patch now.


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:


On Wed, Jan 17, 2018 at 5:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote:


On Sat, Jan 6, 2018 at 7:35 PM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings,

* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Mon, Dec 4, 2017 at 2:38 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > They did apply at the time, but I think major work on vacuum was
> > pushed since then, and also I was traveling so out of reach.
> >
> > It may take some time to rebase them again. Should I move to needs
> > review myself after that?
>
> Sure, if you can get into this state, please feel free to update the
> status of the patch yourself.

We're now over a month since this status update- Claudio, for this to
have a chance during this commitfest to be included (which, personally,
I think would be great as it solves a pretty serious issue..), we really
need to have it be rebased and updated.  Once that's done, as Michael
says, please change the patch status back to 'Needs Review'.

Sorry, had tons of other stuff that took priority.

I'll get to rebase this patch now.



Huh. That was simpler than I thought.

Attached rebased versions.

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Aleksander Alekseev
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            tested, passed

I can confirm that these patches don't break anything; the code is well
documented, has some tests and doesn't do anything obviously wrong. However
I would recommend someone who is more familiar with the VACUUM mechanism than I
do to recheck these patches.

The new status of this patch is: Ready for Committer

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Thomas Munro
Date:
On Thu, Jan 18, 2018 at 9:17 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Huh. That was simpler than I thought.
>
> Attached rebased versions.

Hi Claudio,

FYI the regression test seems to have some run-to-run variation.
Though it usually succeeds, recently I have seen a couple of failures
like this:

========= Contents of ./src/test/regress/regression.diffs
*** /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/expected/vacuum.out
2018-01-24 01:41:28.200454371 +0000
--- /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/vacuum.out
2018-01-24 01:51:07.970049937 +0000
***************
*** 128,134 ****
  SELECT pg_relation_size('vactst', 'main');
   pg_relation_size
  ------------------
!                 0
  (1 row)

  SELECT count(*) FROM vactst;
--- 128,134 ----
  SELECT pg_relation_size('vactst', 'main');
   pg_relation_size
  ------------------
!              8192
  (1 row)

  SELECT count(*) FROM vactst;
======================================================================

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Jan 25, 2018 at 4:11 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Jan 18, 2018 at 9:17 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Huh. That was simpler than I thought.
>>
>> Attached rebased versions.
>
> Hi Claudio,
>
> FYI the regression test seems to have some run-to-run variation.
> Though it usually succeeds, recently I have seen a couple of failures
> like this:
>
> ========= Contents of ./src/test/regress/regression.diffs
> *** /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/expected/vacuum.out
> 2018-01-24 01:41:28.200454371 +0000
> --- /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/vacuum.out
> 2018-01-24 01:51:07.970049937 +0000
> ***************
> *** 128,134 ****
>   SELECT pg_relation_size('vactst', 'main');
>    pg_relation_size
>   ------------------
> !                 0
>   (1 row)
>
>   SELECT count(*) FROM vactst;
> --- 128,134 ----
>   SELECT pg_relation_size('vactst', 'main');
>    pg_relation_size
>   ------------------
> !              8192
>   (1 row)
>
>   SELECT count(*) FROM vactst;
> ======================================================================
>
> --
> Thomas Munro
> http://www.enterprisedb.com

I'll look into it

However, shouldn't an empty relation have an initial page anyway? In
that case shouldn't the correct value be 8192?


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Alvaro Herrera
Date:
Claudio Freire wrote:
> On Thu, Jan 25, 2018 at 4:11 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:

> > *** 128,134 ****
> >   SELECT pg_relation_size('vactst', 'main');
> >    pg_relation_size
> >   ------------------
> > !                 0
> >   (1 row)
> >
> >   SELECT count(*) FROM vactst;
> > --- 128,134 ----
> >   SELECT pg_relation_size('vactst', 'main');
> >    pg_relation_size
> >   ------------------
> > !              8192
> >   (1 row)

> However, shouldn't an empty relation have an initial page anyway? In
> that case shouldn't the correct value be 8192?

No, it's legal for an empty table to have size 0 on disk.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Jan 25, 2018 at 10:56 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Thu, Jan 25, 2018 at 4:11 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Thu, Jan 18, 2018 at 9:17 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> Huh. That was simpler than I thought.
>>>
>>> Attached rebased versions.
>>
>> Hi Claudio,
>>
>> FYI the regression test seems to have some run-to-run variation.
>> Though it usually succeeds, recently I have seen a couple of failures
>> like this:
>>
>> ========= Contents of ./src/test/regress/regression.diffs
>> *** /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/expected/vacuum.out
>> 2018-01-24 01:41:28.200454371 +0000
>> --- /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/vacuum.out
>> 2018-01-24 01:51:07.970049937 +0000
>> ***************
>> *** 128,134 ****
>>   SELECT pg_relation_size('vactst', 'main');
>>    pg_relation_size
>>   ------------------
>> !                 0
>>   (1 row)
>>
>>   SELECT count(*) FROM vactst;
>> --- 128,134 ----
>>   SELECT pg_relation_size('vactst', 'main');
>>    pg_relation_size
>>   ------------------
>> !              8192
>>   (1 row)
>>
>>   SELECT count(*) FROM vactst;
>> ======================================================================
>>
>> --
>> Thomas Munro
>> http://www.enterprisedb.com
>
> I'll look into it

I had the tests running in a loop all day long, and I cannot reproduce
that variance.

Can you share your steps to reproduce it, including configure flags?


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Thomas Munro
Date:
On Fri, Jan 26, 2018 at 9:38 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> I had the tests running in a loop all day long, and I cannot reproduce
> that variance.
>
> Can you share your steps to reproduce it, including configure flags?

Here are two build logs where it failed:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/332968819
https://travis-ci.org/postgresql-cfbot/postgresql/builds/332592511

Here's one where it succeeded:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/333139855

The full build script used is:

./configure --enable-debug --enable-cassert --enable-coverage
--enable-tap-tests --with-tcl --with-python --with-perl --with-ldap
--with-icu && make -j4 all contrib docs && make -Otarget -j3
check-world

This is a virtualised 4 core system.  I wonder if "make -Otarget -j3
check-world" creates enough load on it to produce some weird timing
effect that you don't see on your development system.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Jan 25, 2018 at 6:21 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Jan 26, 2018 at 9:38 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> I had the tests running in a loop all day long, and I cannot reproduce
>> that variance.
>>
>> Can you share your steps to reproduce it, including configure flags?
>
> Here are two build logs where it failed:
>
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/332968819
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/332592511
>
> Here's one where it succeeded:
>
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/333139855
>
> The full build script used is:
>
> ./configure --enable-debug --enable-cassert --enable-coverage
> --enable-tap-tests --with-tcl --with-python --with-perl --with-ldap
> --with-icu && make -j4 all contrib docs && make -Otarget -j3
> check-world
>
> This is a virtualised 4 core system.  I wonder if "make -Otarget -j3
> check-world" creates enough load on it to produce some weird timing
> effect that you don't see on your development system.

I can't reproduce it, not even with the same build script.

It's starting to look like a timing effect indeed.

I get a similar effect if there's an active snapshot in another
session while vacuum runs. I don't know how the test suite ends up in
that situation, but it seems to be the case.

How do you suggest we go about fixing this? The test in question is
important, I've caught actual bugs in the implementation with it,
because it checks that vacuum effectively frees up space.

I'm thinking this vacuum test could be put on its own parallel group
perhaps? Since I can't reproduce it, I can't know whether that will
fix it, but it seems sensible.


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Kyotaro HORIGUCHI
Date:
Hello,

At Fri, 2 Feb 2018 19:52:02 -0300, Claudio Freire <klaussfreire@gmail.com> wrote in
<CAGTBQpaiNQSNJC8y4w82UBTaPsvSqRRg++yEi5wre1MFE2iD8Q@mail.gmail.com>
> On Thu, Jan 25, 2018 at 6:21 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > On Fri, Jan 26, 2018 at 9:38 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> >> I had the tests running in a loop all day long, and I cannot reproduce
> >> that variance.
> >>
> >> Can you share your steps to reproduce it, including configure flags?
> >
> > Here are two build logs where it failed:
> >
> > https://travis-ci.org/postgresql-cfbot/postgresql/builds/332968819
> > https://travis-ci.org/postgresql-cfbot/postgresql/builds/332592511
> >
> > Here's one where it succeeded:
> >
> > https://travis-ci.org/postgresql-cfbot/postgresql/builds/333139855
> >
> > The full build script used is:
> >
> > ./configure --enable-debug --enable-cassert --enable-coverage
> > --enable-tap-tests --with-tcl --with-python --with-perl --with-ldap
> > --with-icu && make -j4 all contrib docs && make -Otarget -j3
> > check-world
> >
> > This is a virtualised 4 core system.  I wonder if "make -Otarget -j3
> > check-world" creates enough load on it to produce some weird timing
> > effect that you don't see on your development system.
> 
> I can't reproduce it, not even with the same build script.

I had the same error by "make -j3 check-world" but only twice
from many trials.

> It's starting to look like a timing effect indeed.

It seems to be truncation skip, maybe caused by concurrent
autovacuum. See lazy_truncate_heap() for details. Updates of
pg_stat_*_tables can be delayed so looking it also can fail. Even
though I haven't looked the patch closer, the "SELECT
pg_relation_size()" doesn't seem to give something meaningful
anyway.

> I get a similar effect if there's an active snapshot in another
> session while vacuum runs. I don't know how the test suite ends up in
> that situation, but it seems to be the case.
> 
> How do you suggest we go about fixing this? The test in question is
> important, I've caught actual bugs in the implementation with it,
> because it checks that vacuum effectively frees up space.
> 
> I'm thinking this vacuum test could be put on its own parallel group
> perhaps? Since I can't reproduce it, I can't know whether that will
> fix it, but it seems sensible.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Tue, Feb 6, 2018 at 4:35 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> It's starting to look like a timing effect indeed.
>
> It seems to be truncation skip, maybe caused by concurrent
> autovacuum.

Good point, I'll also disable autovacuum on vactst.

> See lazy_truncate_heap() for details. Updates of
> pg_stat_*_tables can be delayed so looking it also can fail. Even
> though I haven't looked the patch closer, the "SELECT
> pg_relation_size()" doesn't seem to give something meaningful
> anyway.

Maybe then "explain (analyze, buffers, costs off, timing off, summary
off) select * from vactst" then.

The point is to check that the relation's heap has 0 pages.


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Tue, Feb 6, 2018 at 10:18 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Tue, Feb 6, 2018 at 4:35 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> It's starting to look like a timing effect indeed.
>>
>> It seems to be truncation skip, maybe caused by concurrent
>> autovacuum.
>
> Good point, I'll also disable autovacuum on vactst.
>
>> See lazy_truncate_heap() for details. Updates of
>> pg_stat_*_tables can be delayed so looking it also can fail. Even
>> though I haven't looked the patch closer, the "SELECT
>> pg_relation_size()" doesn't seem to give something meaningful
>> anyway.
>
> Maybe then "explain (analyze, buffers, costs off, timing off, summary
> off) select * from vactst" then.
>
> The point is to check that the relation's heap has 0 pages.

Attached rebased patches with those changes mentioned above, namely:

- vacuum test now creates vactst with autovacuum disabled for it
- vacuum test on its own parallel group
- use explain analyze instead of pg_relation_size to check the
relation is properly truncated

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Kyotaro HORIGUCHI
Date:
Hello,

At Tue, 6 Feb 2018 10:41:22 -0300, Claudio Freire <klaussfreire@gmail.com> wrote in
<CAGTBQpaufC0o8OikWd8=5biXcbYjT51mPLfmy22NUjX6kUED0A@mail.gmail.com>
> On Tue, Feb 6, 2018 at 10:18 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > On Tue, Feb 6, 2018 at 4:35 AM, Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >>> It's starting to look like a timing effect indeed.
> >>
> >> It seems to be truncation skip, maybe caused by concurrent
> >> autovacuum.
> >
> > Good point, I'll also disable autovacuum on vactst.
> >
> >> See lazy_truncate_heap() for details. Updates of
> >> pg_stat_*_tables can be delayed so looking it also can fail. Even
> >> though I haven't looked the patch closer, the "SELECT
> >> pg_relation_size()" doesn't seem to give something meaningful
> >> anyway.
> >
> > Maybe then "explain (analyze, buffers, costs off, timing off, summary
> > off) select * from vactst" then.

Ah, sorry. I meant by the above that it gives unstable result
with autovacuum. So pg_relation_size() is usable after you turned
of autovacuum on the table.

> > The point is to check that the relation's heap has 0 pages.
> 
> Attached rebased patches with those changes mentioned above, namely:
> 
> - vacuum test now creates vactst with autovacuum disabled for it
> - vacuum test on its own parallel group
> - use explain analyze instead of pg_relation_size to check the
> relation is properly truncated

The problematic test was in the 0001..v14 patch. The new
0001..v15 is identical to v14 and instead 0003-v8 has additional
part that edits the test and expects added by 0001 into the shape
as above. It seems that you merged the fixup onto the wrong
commit.

And may we assume it correct that 0002 is missing in this
patchset?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Feb 7, 2018 at 12:50 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Tue, 6 Feb 2018 10:41:22 -0300, Claudio Freire <klaussfreire@gmail.com> wrote in
<CAGTBQpaufC0o8OikWd8=5biXcbYjT51mPLfmy22NUjX6kUED0A@mail.gmail.com>
>> On Tue, Feb 6, 2018 at 10:18 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> > On Tue, Feb 6, 2018 at 4:35 AM, Kyotaro HORIGUCHI
>> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> >>> It's starting to look like a timing effect indeed.
>> >>
>> >> It seems to be truncation skip, maybe caused by concurrent
>> >> autovacuum.
>> >
>> > Good point, I'll also disable autovacuum on vactst.
>> >
>> >> See lazy_truncate_heap() for details. Updates of
>> >> pg_stat_*_tables can be delayed so looking it also can fail. Even
>> >> though I haven't looked the patch closer, the "SELECT
>> >> pg_relation_size()" doesn't seem to give something meaningful
>> >> anyway.
>> >
>> > Maybe then "explain (analyze, buffers, costs off, timing off, summary
>> > off) select * from vactst" then.
>
> Ah, sorry. I meant by the above that it gives unstable result
> with autovacuum. So pg_relation_size() is usable after you turned
> of autovacuum on the table.

You did mention stats could be delayed

>> > The point is to check that the relation's heap has 0 pages.
>>
>> Attached rebased patches with those changes mentioned above, namely:
>>
>> - vacuum test now creates vactst with autovacuum disabled for it
>> - vacuum test on its own parallel group
>> - use explain analyze instead of pg_relation_size to check the
>> relation is properly truncated
>
> The problematic test was in the 0001..v14 patch. The new
> 0001..v15 is identical to v14 and instead 0003-v8 has additional
> part that edits the test and expects added by 0001 into the shape
> as above. It seems that you merged the fixup onto the wrong
> commit.
>
> And may we assume it correct that 0002 is missing in this
> patchset?

Sounds like I botched the rebase. Sorry about that.

Attached are corrected versions (1-v16 and 3-v9)

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Alvaro Herrera
Date:
Claudio Freire wrote:

> - vacuum test on its own parallel group

Hmm, this solution is not very friendly to the goal of reducing test
runtime, particularly since the new test creates a nontrivial-sized
table.  Maybe we can find a better alternative.  Can we use some wait
logic instead?  Maybe something like grab a snapshot of running VXIDs
and loop waiting until they're all gone before doing the vacuum?

Also, I don't understand why pg_relation_size() is not a better solution
to determining the table size compared to explain.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Feb 7, 2018 at 7:57 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Claudio Freire wrote:
>
>> - vacuum test on its own parallel group
>
> Hmm, this solution is not very friendly to the goal of reducing test
> runtime, particularly since the new test creates a nontrivial-sized
> table.  Maybe we can find a better alternative.  Can we use some wait
> logic instead?  Maybe something like grab a snapshot of running VXIDs
> and loop waiting until they're all gone before doing the vacuum?

I'm not sure there's any alternative. I did some tests and any active
snapshot on any other table, not necessarily on the one being
vacuumed, distorted the test. And it makes sense, since that snapshot
makes those deleted tuples unvacuumable.

Waiting as you say would be akin to what the patch does by putting
vacuum on its own parallel group.

I'm guessing all tests write something to the database, so all tests
will create a snapshot. Maybe if there were read-only tests, those
might be safe to include in vacuum's parallel group, but otherwise I
don't see any alternative.

> Also, I don't understand why pg_relation_size() is not a better solution
> to determining the table size compared to explain.

I was told pg_relation_size can return stale information. I didn't
check, I took that at face value.


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Alvaro Herrera
Date:
Claudio Freire wrote:
> On Wed, Feb 7, 2018 at 7:57 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > Claudio Freire wrote:

> > Hmm, this solution is not very friendly to the goal of reducing test
> > runtime, particularly since the new test creates a nontrivial-sized
> > table.  Maybe we can find a better alternative.  Can we use some wait
> > logic instead?  Maybe something like grab a snapshot of running VXIDs
> > and loop waiting until they're all gone before doing the vacuum?
> 
> I'm not sure there's any alternative. I did some tests and any active
> snapshot on any other table, not necessarily on the one being
> vacuumed, distorted the test. And it makes sense, since that snapshot
> makes those deleted tuples unvacuumable.

Sure.

> Waiting as you say would be akin to what the patch does by putting
> vacuum on its own parallel group.

I don't think it's the same.  We don't need to wait until all the
concurrent tests are done -- we only need to wait until the transactions
that were current when the delete finished are done, which is very
different since each test runs tons of small transactions rather than
one single big transaction.

> > Also, I don't understand why pg_relation_size() is not a better solution
> > to determining the table size compared to explain.
> 
> I was told pg_relation_size can return stale information. I didn't
> check, I took that at face value.

Hmm, it uses stat() on the table files.  I think those files would be
truncated at the time the transaction commits, so they shouldn't be
stale.  (I don't think the system waits for a checkpoint to flush a
truncation.)  Maybe relying on that is not reliable or future-proof
enough.  Anyway this is a minor point -- the one above worries me most.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Feb 7, 2018 at 8:52 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> Waiting as you say would be akin to what the patch does by putting
>> vacuum on its own parallel group.
>
> I don't think it's the same.  We don't need to wait until all the
> concurrent tests are done -- we only need to wait until the transactions
> that were current when the delete finished are done, which is very
> different since each test runs tons of small transactions rather than
> one single big transaction.

Um... maybe "lock pg_class" ?

That should conflict with basically any other running transaction and
have pretty much that effect.

Attached is a version of patch 1 with that approach.

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Alvaro Herrera
Date:
Claudio Freire wrote:
> On Wed, Feb 7, 2018 at 8:52 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >> Waiting as you say would be akin to what the patch does by putting
> >> vacuum on its own parallel group.
> >
> > I don't think it's the same.  We don't need to wait until all the
> > concurrent tests are done -- we only need to wait until the transactions
> > that were current when the delete finished are done, which is very
> > different since each test runs tons of small transactions rather than
> > one single big transaction.
> 
> Um... maybe "lock pg_class" ?

I was thinking in first doing 
  SELECT array_agg(DISTINCT virtualtransaction) vxids
    FROM pg_locks \gset

and then in a DO block loop until

   SELECT DISTINCT virtualtransaction
     FROM pg_locks
INTERSECT
   SELECT (unnest(:'vxids'::text[]));

returns empty; something along those lines.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Wed, Feb 7, 2018 at 11:29 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Claudio Freire wrote:
>> On Wed, Feb 7, 2018 at 8:52 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> >> Waiting as you say would be akin to what the patch does by putting
>> >> vacuum on its own parallel group.
>> >
>> > I don't think it's the same.  We don't need to wait until all the
>> > concurrent tests are done -- we only need to wait until the transactions
>> > that were current when the delete finished are done, which is very
>> > different since each test runs tons of small transactions rather than
>> > one single big transaction.
>>
>> Um... maybe "lock pg_class" ?
>
> I was thinking in first doing
>   SELECT array_agg(DISTINCT virtualtransaction) vxids
>     FROM pg_locks \gset
>
> and then in a DO block loop until
>
>    SELECT DISTINCT virtualtransaction
>      FROM pg_locks
> INTERSECT
>    SELECT (unnest(:'vxids'::text[]));
>
> returns empty; something along those lines.

Isn't it the same though?

I can't think how a transaction wouldn't be holding at least an access
share on pg_class.


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Feb 8, 2018 at 12:13 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Wed, Feb 7, 2018 at 11:29 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> Claudio Freire wrote:
>>> On Wed, Feb 7, 2018 at 8:52 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>>> >> Waiting as you say would be akin to what the patch does by putting
>>> >> vacuum on its own parallel group.
>>> >
>>> > I don't think it's the same.  We don't need to wait until all the
>>> > concurrent tests are done -- we only need to wait until the transactions
>>> > that were current when the delete finished are done, which is very
>>> > different since each test runs tons of small transactions rather than
>>> > one single big transaction.
>>>
>>> Um... maybe "lock pg_class" ?
>>
>> I was thinking in first doing
>>   SELECT array_agg(DISTINCT virtualtransaction) vxids
>>     FROM pg_locks \gset
>>
>> and then in a DO block loop until
>>
>>    SELECT DISTINCT virtualtransaction
>>      FROM pg_locks
>> INTERSECT
>>    SELECT (unnest(:'vxids'::text[]));
>>
>> returns empty; something along those lines.
>
> Isn't it the same though?
>
> I can't think how a transaction wouldn't be holding at least an access
> share on pg_class.

Never mind, I just saw the error of my ways.

I don't like looping, though, seems overly cumbersome. What's worse?
maintaining that fragile weird loop that might break (by causing
unexpected output), or a slight slowdown of the test suite? I don't
know how long it might take on slow machines, but in my machine, which
isn't a great machine, while the vacuum test isn't fast indeed, it's
just a tiny fraction of what a simple "make check" takes. So it's not
a huge slowdown in any case.

I'll give it some thought, maybe there's a simpler way.


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Kyotaro HORIGUCHI
Date:
Hello, I looked this a bit closer.

In upthread[1] Robert mentioned the exponentially increasing size
of additional segments.

>> Hmm, I had imagined making all of the segments the same size rather
>> than having the size grow exponentially.  The whole point of this is
>> to save memory, and even in the worst case you don't end up with that
>> many segments as long as you pick a reasonable base size (e.g. 64MB).
>
> Wastage is bound by a fraction of the total required RAM, that is,
> it's proportional to the amount of required RAM, not the amount
> allocated. So it should still be fine, and the exponential strategy
> should improve lookup performance considerably.

It seems that you are getting him wrong. (Anyway I'm not sure
what you meant by the above. not-yet-allocated memory won't be a
waste.) The conclusive number of dead tuples in a heap scan is
undeteminable until the scan ends. If we had a new dead tuple
required a, say 512MB new segment and the scan ends just after,
the wastage will be almost the whole of the segment.

On the other hand, I don't think the exponential strategy make
things considerably better. bsearch iterations in
lazy_tid_reaped() are distributed between segment search and tid
search. Intuitively more or less the increased segment size just
moves some iterations of the former to the latter.

I made a calculation[2]. With maintemance_work_mem of 4096MB, the
number of segments is 6 and expected number of bsearch iteration
is about 20.8 for the exponential strategy. With 64MB fixed size
segments, we will have 64 segments (that is not so many) and the
expected iteration is 20.4. (I suppose the increase comes from
the imbalanced size among segments.) Addition to that, as Robert
mentioned, the possible maximum memory wastage of the exponential
strategy is about 2GB and 64MB in fixed size strategy.

Seeing these numbers, I don't tend to take the exponential
strategy.


[1] https://www.postgresql.org/message-id/CAGTBQpbZX5S4QrnB6YP-2Nk+A9bxbaVktzKwsGvMeov3MTgdiQ@mail.gmail.com

[2] See attached perl script. I hope it is correct.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#! /usr/bin/perl

$maxmem=1024 * 4;
#=====
print "exponential sized strategy\n";
$ss = 64;
$ts = 0;
$sumiteritem = 0;
for ($i = 1 ; $ts < $maxmem ; $i++) {
    $ss = $ss * 2;
    if ($ts + $ss > $maxmem) {
        $ss = $maxmem - $ts;
    }
    $ts += $ss;
    $ntups = $ts*1024*1024 / 6;
    $ntupinseg = $ss*1024*1024 / 6;
    $npages = $ntups / 291;
    $tsize = $npages * 8192.0 / 1024 / 1024 / 1024;
    $sumiteritem += log($ntupinseg) * $ntupinseg; # weight by percentage in all tuples
    printf("#%d : segsize=%dMB total=%dMB, (tuples = %ld, min tsize=%.1fGB), iterseg(%d)=%f, iteritem(%d) = %f,
expectediter=%f\n",
 
           $i, $ss, $ts, $ntups, $tsize,
           $i, log($i), $ntupinseg, log($ntupinseg), log($i) + $sumiteritem/$ntups);
}

print "\n\nfixed sized strategy\n";
$ss = 64;
$ts = 0;
for ($i = 1 ; $ts < $maxmem ; $i++) {
    $ts += $ss;
    $ntups = $ts*1024*1024 / 6;
    $ntupinseg = $ss*1024*1024 / 6;
    $npages = $ntups / 300;
    $tsize = $npages * 8192.0 / 1024 / 1024 / 1024;
    printf("#%d : segsize=%dMB total=%dMB, (tuples = %ld, min tsize=%.1fGB), interseg(%d)=%f, iteritem(%d) = %f,
expectediter=%f\n",
 
           $i, $ss, $ts, $ntups, $tsize,
        $i, log($i), $ntupinseg, log($ntupinseg), log($i) + log($ntupinseg));
}



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Feb 8, 2018 at 2:44 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello, I looked this a bit closer.
>
> In upthread[1] Robert mentioned the exponentially increasing size
> of additional segments.
>
>>> Hmm, I had imagined making all of the segments the same size rather
>>> than having the size grow exponentially.  The whole point of this is
>>> to save memory, and even in the worst case you don't end up with that
>>> many segments as long as you pick a reasonable base size (e.g. 64MB).
>>
>> Wastage is bound by a fraction of the total required RAM, that is,
>> it's proportional to the amount of required RAM, not the amount
>> allocated. So it should still be fine, and the exponential strategy
>> should improve lookup performance considerably.
>
> It seems that you are getting him wrong. (Anyway I'm not sure
> what you meant by the above. not-yet-allocated memory won't be a
> waste.) The conclusive number of dead tuples in a heap scan is
> undeteminable until the scan ends. If we had a new dead tuple
> required a, say 512MB new segment and the scan ends just after,
> the wastage will be almost the whole of the segment.

And the segment size is bound by a fraction of total needed memory.

When I said "allocated", I meant m_w_m. Wastage is not proportional to m_w_m.

> On the other hand, I don't think the exponential strategy make
> things considerably better. bsearch iterations in
> lazy_tid_reaped() are distributed between segment search and tid
> search. Intuitively more or less the increased segment size just
> moves some iterations of the former to the latter.
>
> I made a calculation[2]. With maintemance_work_mem of 4096MB, the
> number of segments is 6 and expected number of bsearch iteration
> is about 20.8 for the exponential strategy. With 64MB fixed size
> segments, we will have 64 segments (that is not so many) and the
> expected iteration is 20.4. (I suppose the increase comes from
> the imbalanced size among segments.) Addition to that, as Robert
> mentioned, the possible maximum memory wastage of the exponential
> strategy is about 2GB and 64MB in fixed size strategy.

That calculation has a slight bug in that it should be log2, and that
segment size is limited to 1GB at the top end.

But in any case, the biggest issue is that it's ignoring the effect of
cache locality. The way in which the exponential strategy helps, is by
keeping the segment list small and comfortably fitting in fast cache
memory, while also keeping wastage at a minimum for small lists. 64MB
segments with 4G mwm would be about 2kb of segment list. It fits in
L1, if there's nothing else contending for it, but it's already
starting to get big, and it would be expected settings larger than 4G
mwm could be used.

I guess I could tune the starting/ending sizes a bit.

Say, with an upper limit of 256M (instead of 1G), and after fixing the
other issues, we get:

exponential sized strategy
...
#18 : segsize=64MB total=4096MB, (tuples = 715827882, min
tsize=18.8GB), iterseg(18)=4.169925, iteritem(11184810) = 23.415037,
expected iter=29.491213


fixed sized strategy
...
#64 : segsize=64MB total=4096MB, (tuples = 715827882, min
tsize=18.2GB), interseg(64)=6.000000, iteritem(11184810) = 23.415037,
expected iter=29.415037

Almost identical, and we get all the benefits of cache locality with
the exponential strategy. The fixed strategy might fit in the L1, but
it's less likely the bigger the mwm is.

The scaling factor could also be tuned I guess, but I'm wary of using
anything other than a doubling strategy, since it might cause memory
fragmentation.


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Alvaro Herrera
Date:
Claudio Freire wrote:

> I don't like looping, though, seems overly cumbersome. What's worse?
> maintaining that fragile weird loop that might break (by causing
> unexpected output), or a slight slowdown of the test suite?
>
> I don't know how long it might take on slow machines, but in my
> machine, which isn't a great machine, while the vacuum test isn't fast
> indeed, it's just a tiny fraction of what a simple "make check" takes.
> So it's not a huge slowdown in any case.

Well, what about a machine running tests under valgrind, or the weird
cache-clobbering infuriatingly slow code?  Or buildfarm members running
on really slow hardware?  These days, a few people have spent a lot of
time trying to reduce the total test time, and it'd be bad to lose back
the improvements for no good reason.

I grant you that the looping I proposed is more complicated, but I don't
see any reason why it would break.

Another argument against the LOCK pg_class idea is that it causes an
unnecessary contention point across the whole parallel test group --
with possible weird side effects.  How about a deadlock?

Other than the wait loop I proposed, I think we can make a couple of
very simple improvements to this test case to avoid a slowdown:

1. the DELETE takes about 1/4th of the time and removes about the same
number of rows as the one using the IN clause:
  delete from vactst where random() < 3.0 / 4;

2. Use a new temp table rather than vactst.  Everything is then faster.

3. Figure out the minimum size for the table that triggers the behavior
   you want.  Right now you use 400k tuples -- maybe 100k are sufficient?
   Don't know.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Feb 8, 2018 at 8:39 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Claudio Freire wrote:
>
>> I don't like looping, though, seems overly cumbersome. What's worse?
>> maintaining that fragile weird loop that might break (by causing
>> unexpected output), or a slight slowdown of the test suite?
>>
>> I don't know how long it might take on slow machines, but in my
>> machine, which isn't a great machine, while the vacuum test isn't fast
>> indeed, it's just a tiny fraction of what a simple "make check" takes.
>> So it's not a huge slowdown in any case.
>
> Well, what about a machine running tests under valgrind, or the weird
> cache-clobbering infuriatingly slow code?  Or buildfarm members running
> on really slow hardware?  These days, a few people have spent a lot of
> time trying to reduce the total test time, and it'd be bad to lose back
> the improvements for no good reason.

It's not for no good reason. The old tests were woefully inadequate.

During the process of developing the patch, I got seriously broken
code that passed the tests nonetheless. The test as it was was very
ineffective at actually detecting issues.

This new test may be slow, but it's effective. That's a very good
reason to make it slower, if you ask me.

> I grant you that the looping I proposed is more complicated, but I don't
> see any reason why it would break.
>
> Another argument against the LOCK pg_class idea is that it causes an
> unnecessary contention point across the whole parallel test group --
> with possible weird side effects.  How about a deadlock?

The real issue with lock pg_class is that locks on pg_class are
short-lived, so I'm not waiting for whole transactions.

> Other than the wait loop I proposed, I think we can make a couple of
> very simple improvements to this test case to avoid a slowdown:
>
> 1. the DELETE takes about 1/4th of the time and removes about the same
> number of rows as the one using the IN clause:
>   delete from vactst where random() < 3.0 / 4;

I did try this at first, but it causes random output, so the test
breaks randomly.

> 2. Use a new temp table rather than vactst.  Everything is then faster.

I might try that.

> 3. Figure out the minimum size for the table that triggers the behavior
>    you want.  Right now you use 400k tuples -- maybe 100k are sufficient?
>    Don't know.

For that test, I need enough *dead* tuples to cause several passes.
Even small mwm settings require tons of tuples for this. In fact, I'm
thinking that number might be too low for its purpose, even. I'll
re-check, but I doubt it's too high. If anything, it's too low.


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Alvaro Herrera
Date:
Claudio Freire wrote:
> On Thu, Feb 8, 2018 at 8:39 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> During the process of developing the patch, I got seriously broken
> code that passed the tests nonetheless. The test as it was was very
> ineffective at actually detecting issues.
> 
> This new test may be slow, but it's effective. That's a very good
> reason to make it slower, if you ask me.

OK, I don't disagree with improving the test, but if we can make it fast
*and* effective, that's better than slow and effective.

> > Another argument against the LOCK pg_class idea is that it causes an
> > unnecessary contention point across the whole parallel test group --
> > with possible weird side effects.  How about a deadlock?
> 
> The real issue with lock pg_class is that locks on pg_class are
> short-lived, so I'm not waiting for whole transactions.

Doh.

> > Other than the wait loop I proposed, I think we can make a couple of
> > very simple improvements to this test case to avoid a slowdown:
> >
> > 1. the DELETE takes about 1/4th of the time and removes about the same
> > number of rows as the one using the IN clause:
> >   delete from vactst where random() < 3.0 / 4;
> 
> I did try this at first, but it causes random output, so the test
> breaks randomly.

OK.  Still, your query seqscans the table twice.  Maybe it's possible to
use a CTID scan to avoid that, but I'm not sure how.

> > 3. Figure out the minimum size for the table that triggers the behavior
> >    you want.  Right now you use 400k tuples -- maybe 100k are sufficient?
> >    Don't know.
> 
> For that test, I need enough *dead* tuples to cause several passes.
> Even small mwm settings require tons of tuples for this. In fact, I'm
> thinking that number might be too low for its purpose, even. I'll
> re-check, but I doubt it's too high. If anything, it's too low.

OK.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Fri, Feb 9, 2018 at 10:32 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Claudio Freire wrote:
>> On Thu, Feb 8, 2018 at 8:39 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
>> During the process of developing the patch, I got seriously broken
>> code that passed the tests nonetheless. The test as it was was very
>> ineffective at actually detecting issues.
>>
>> This new test may be slow, but it's effective. That's a very good
>> reason to make it slower, if you ask me.
>
> OK, I don't disagree with improving the test, but if we can make it fast
> *and* effective, that's better than slow and effective.

I'd love to have a test that uses multiple segments of dead tuples,
but for that, it needs to use more than 64MB of mwm. That amounts to,
basically, ~12M rows.

Is there a "slow test suite" where such a test could be added that
won't bother regular "make check"?

That, or we turn the initial segment size into a GUC, but I don't
think it's a useful GUC outside of the test suite.

>> > 3. Figure out the minimum size for the table that triggers the behavior
>> >    you want.  Right now you use 400k tuples -- maybe 100k are sufficient?
>> >    Don't know.
>>
>> For that test, I need enough *dead* tuples to cause several passes.
>> Even small mwm settings require tons of tuples for this. In fact, I'm
>> thinking that number might be too low for its purpose, even. I'll
>> re-check, but I doubt it's too high. If anything, it's too low.
>
> OK.

Turns out that it was a tad oversized. 300k tuples seems enough.

Attached is a new patch version that:

- Uses an unlogged table to make the large mwm test faster
- Uses a wait_barrier helper that waits for concurrent transactions
  to finish before vacuuming tables, to make sure deleted tuples
  actually are vacuumable
- Tweaks the size of the large mwm test to be as small as possible
- Optimizes the delete to avoid expensive operations yet attain
  the same end result

Attachment

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Fri, Aug 18, 2017 at 8:39 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Fri, Apr 7, 2017 at 10:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Indeed they do, and that's what motivated this patch. But I'd need
>> TB-sized tables to set up something like that. I don't have the
>> hardware or time available to do that (vacuum on bloated TB-sized
>> tables can take days in my experience). Scale 4000 is as big as I can
>> get without running out of space for the tests in my test hardware.
>>
>> If anybody else has the ability, I'd be thankful if they did test it
>> under those conditions, but I cannot. I think Anastasia's test is
>> closer to such a test, that's probably why it shows a bigger
>> improvement in total elapsed time.
>>
>> Our production database could possibly be used, but it can take about
>> a week to clone it, upgrade it (it's 9.5 currently), and run the
>> relevant vacuum.
>
> It looks like I won't be able to do that test with a production
> snapshot anytime soon.
>
> Getting approval for the budget required to do that looks like it's
> going to take far longer than I thought.

I finally had a chance to test the patch in a production snapshot.

Actually, I tried to take out 2 birds with one stone, and I'm also
testing the FSM vacuum patch. It shouldn't significantly alter the
numbers anyway.

So, while the whole-db throttled vacuum (as is run in production) is
still ongoing, an interesting case already popped up.

TL;DR, without the patch, this particular table took 16 1/2 hours more
or less, to vacuum 313M dead tuples. With the patch, it took 6:10h to
vacuum 323M dead tuples. That's quite a speedup. It even used
significantly less CPU time as well.

Since vacuum here is throttled (with cost-based delays), this also
means it generated less I/O.

We have more extreme cases sometimes, so if I see something
interesting in what remains of the test, I'll post those results as
well.

The raw data:

Patched

INFO:  vacuuming "public.aggregated_tracks_hourly_full"
INFO:  scanned index "aggregated_tracks_hourly_full_pkey_null" to
remove 323778164 row versions
DETAIL:  CPU: user: 111.57 s, system: 31.28 s, elapsed: 2693.67 s
INFO:  scanned index "ix_aggregated_tracks_hourly_full_action_null" to
remove 323778164 row versions
DETAIL:  CPU: user: 281.89 s, system: 36.32 s, elapsed: 2915.94 s
INFO:  scanned index "ix_aggregated_tracks_hourly_full_nunq" to remove
323778164 row versions
DETAIL:  CPU: user: 313.35 s, system: 79.22 s, elapsed: 6400.87 s
INFO:  "aggregated_tracks_hourly_full": removed 323778164 row versions
in 7070739 pages
DETAIL:  CPU: user: 583.48 s, system: 69.77 s, elapsed: 8048.00 s
INFO:  index "aggregated_tracks_hourly_full_pkey_null" now contains
720807609 row versions in 10529903 pages
DETAIL:  43184641 index row versions were removed.
5288916 index pages have been deleted, 4696227 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.03 s.
INFO:  index "ix_aggregated_tracks_hourly_full_action_null" now
contains 720807609 row versions in 7635161 pages
DETAIL:  202678522 index row versions were removed.
4432789 index pages have been deleted, 3727966 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.01 s.
INFO:  index "ix_aggregated_tracks_hourly_full_nunq" now contains
720807609 row versions in 15526885 pages
DETAIL:  202678522 index row versions were removed.
9052488 index pages have been deleted, 7390279 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s.
INFO:  "aggregated_tracks_hourly_full": found 41131260 removable,
209520861 nonremovable row versions in 7549244 out of 22391603 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 245834316
There were 260553451 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 1329.64 s, system: 244.22 s, elapsed: 22222.14 s.


Vanilla 9.5 (ie: what's in production right now, should be similar to master):

INFO:  vacuuming "public.aggregated_tracks_hourly_full"
INFO:  scanned index "aggregated_tracks_hourly_full_pkey_null" to
remove 178956729 row versions
DETAIL:  CPU 65.51s/253.67u sec elapsed 3490.13 sec.
INFO:  scanned index "ix_aggregated_tracks_hourly_full_action_null" to
remove 178956729 row versions
DETAIL:  CPU 63.26s/238.08u sec elapsed 3483.32 sec.
INFO:  scanned index "ix_aggregated_tracks_hourly_full_nunq" to remove
178956729 row versions
DETAIL:  CPU 340.15s/445.52u sec elapsed 15898.48 sec.
INFO:  "aggregated_tracks_hourly_full": removed 178956729 row versions
in 3121122 pages
DETAIL:  CPU 168.24s/159.20u sec elapsed 5678.51 sec.
INFO:  scanned index "aggregated_tracks_hourly_full_pkey_null" to
remove 134424729 row versions
DETAIL:  CPU 50.66s/265.19u sec elapsed 3977.15 sec.
INFO:  scanned index "ix_aggregated_tracks_hourly_full_action_null" to
remove 134424729 row versions
DETAIL:  CPU 99.68s/326.44u sec elapsed 6580.22 sec.
INFO:  scanned index "ix_aggregated_tracks_hourly_full_nunq" to remove
134424729 row versions
DETAIL:  CPU 146.96s/358.86u sec elapsed 10464.69 sec.
INFO:  "aggregated_tracks_hourly_full": removed 134424729 row versions
in 2072649 pages
DETAIL:  CPU 109.07s/37.12u sec elapsed 3601.39 sec.
INFO:  index "aggregated_tracks_hourly_full_pkey_null" now contains
870543969 row versions in 10529903 pages
DETAIL:  134424771 index row versions were removed.
4358027 index pages have been deleted, 3662385 are currently reusable.
CPU 0.02s/0.00u sec elapsed 2.42 sec.
INFO:  index "ix_aggregated_tracks_hourly_full_action_null" now
contains 870543969 row versions in 7635161 pages
DETAIL:  134424771 index row versions were removed.
3908583 index pages have been deleted, 3445049 are currently reusable.
CPU 0.02s/0.00u sec elapsed 0.08 sec.
INFO:  index "ix_aggregated_tracks_hourly_full_nunq" now contains
870543969 row versions in 15526885 pages
DETAIL:  218955943 index row versions were removed.
7710441 index pages have been deleted, 5928522 are currently reusable.
CPU 0.02s/0.01u sec elapsed 0.19 sec.
INFO:  "aggregated_tracks_hourly_full": found 134159696 removable,
90271560 nonremovable row versions in 6113375 out of 22391603 pages
DETAIL:  287 dead row versions cannot be removed yet.
There were 126680434 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 1191.42s/2223.19u sec elapsed 59885.50 sec.


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Fri, Feb 9, 2018 at 1:05 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Turns out that it was a tad oversized. 300k tuples seems enough.
>
> Attached is a new patch version that:
>
> - Uses an unlogged table to make the large mwm test faster
> - Uses a wait_barrier helper that waits for concurrent transactions
>   to finish before vacuuming tables, to make sure deleted tuples
>   actually are vacuumable
> - Tweaks the size of the large mwm test to be as small as possible
> - Optimizes the delete to avoid expensive operations yet attain
>   the same end result

Attached rebased versions of the patches (they weren't applying to
current master)

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Aleksander Alekseev
Date:
Hello everyone,

I would like to let you know that unfortunately these patches don't apply anymore. Also personally I'm a bit confused
bythe last message that has 0001- and 0003- patches attached but not the 0002- one. 

Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
I didn't receive your comment, I just saw it. Nevertheless, I rebased the patches a while ago just because I noticed
theydidn't apply anymore in cputube, and they still seem to apply.
 

Patch number 2 was committed a long while ago, that's why it's missing. It was a simple patch, it landed rewritten as
commit7e26e02eec90370dd222f35f00042f8188488ac4 

Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Tue, Apr 3, 2018 at 11:06 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> I didn't receive your comment, I just saw it. Nevertheless, I rebased the patches a while ago just because I noticed
theydidn't apply anymore in cputube, and they still seem to apply.
 

Sorry, that is false.

They appear green in cputube, so I was confident they did apply, but I
just double-checked on a recent pull and they don't. I'll rebase them
shortly.


Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Tue, Apr 3, 2018 at 11:09 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Tue, Apr 3, 2018 at 11:06 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> I didn't receive your comment, I just saw it. Nevertheless, I rebased the patches a while ago just because I noticed
theydidn't apply anymore in cputube, and they still seem to apply.
 
>
> Sorry, that is false.
>
> They appear green in cputube, so I was confident they did apply, but I
> just double-checked on a recent pull and they don't. I'll rebase them
> shortly.


Ok, rebased patches attached

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Heikki Linnakangas
Date:
On 03/04/18 17:20, Claudio Freire wrote:
> Ok, rebased patches attached

Thanks! I took a look at this.

First, now that the data structure is more complicated, I think it's 
time to abstract it, and move it out of vacuumlazy.c. The Tid Map needs 
to support the following operations:

* Add TIDs, in order (in 1st phase of vacuum)
* Random lookup, by TID (when scanning indexes)
* Iterate through all TIDs, in order (2nd pass over heap)

Let's add a new source file to hold the code for the tid map data 
structure, with functions corresponding those operations.

I took a stab at doing that, and I think it makes vacuumlazy.c nicer.

Secondly, I'm not a big fan of the chosen data structure. I think the 
only reason that the segmented "multi-array" was chosen is that each 
"segment" works is similar to the simple array that we used to have. 
After putting it behind the abstraction, it seems rather ad hoc. There 
are many standard textbook data structures that we could use instead, 
and would be easier to understand and reason about, I think.

So, I came up with the attached patch. I used a B-tree as the data 
structure. Not sure it's the best one, I'm all ears for suggestions and 
bikeshedding on alternatives, but I'm pretty happy with that. I would 
expect it to be pretty close to the simple array with binary search in 
performance characteristics. It'd be pretty straightforward to optimize 
further, and e.g. use a bitmap of OffsetNumbers or some other more dense 
data structure in the B-tree leaf pages, but I resisted doing that as 
part of this patch.

I haven't done any performance testing of this (and not much testing in 
general), but at least the abstraction seems better this way. 
Performance testing would be good, too. In particular, I'd like to know 
how this might affect the performance of lazy_tid_reaped(). That's a hot 
spot when vacuuming indexes, so we don't want to add any cycles there. 
Was there any ready-made test kits on that in this thread? I didn't see 
any at a quick glance, but it's a long thread..

- Heikki

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Apr 5, 2018 at 5:02 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 03/04/18 17:20, Claudio Freire wrote:
>>
>> Ok, rebased patches attached
>
>
> Thanks! I took a look at this.
>
> First, now that the data structure is more complicated, I think it's time to
> abstract it, and move it out of vacuumlazy.c. The Tid Map needs to support
> the following operations:
>
> * Add TIDs, in order (in 1st phase of vacuum)
> * Random lookup, by TID (when scanning indexes)
> * Iterate through all TIDs, in order (2nd pass over heap)
>
> Let's add a new source file to hold the code for the tid map data structure,
> with functions corresponding those operations.
>
> I took a stab at doing that, and I think it makes vacuumlazy.c nicer.

About the refactoring to split this into their own set of files and
abstract away the underlying structure, I can totally get behind that.

The iteration interface, however, seems quite specific for the use
case of vacuumlazy, so it's not really a good abstraction. It also
copies stuff a lot, so it's quite heavyweight. I'd suggest trying to
go for a lighter weight interface with less overhead that is more
general at the same time.

If it was C++, I'd say build an iterator class.

C would do it probably with macros, so you can have a macro to get to
the current element, another to advance to the next element, and
another to check whether you've reached the end.

I can do that if we agree on the points below:

> Secondly, I'm not a big fan of the chosen data structure. I think the only
> reason that the segmented "multi-array" was chosen is that each "segment"
> works is similar to the simple array that we used to have. After putting it
> behind the abstraction, it seems rather ad hoc. There are many standard
> textbook data structures that we could use instead, and would be easier to
> understand and reason about, I think.
>
> So, I came up with the attached patch. I used a B-tree as the data
> structure. Not sure it's the best one, I'm all ears for suggestions and
> bikeshedding on alternatives, but I'm pretty happy with that. I would expect
> it to be pretty close to the simple array with binary search in performance
> characteristics. It'd be pretty straightforward to optimize further, and
> e.g. use a bitmap of OffsetNumbers or some other more dense data structure
> in the B-tree leaf pages, but I resisted doing that as part of this patch.

About the B-tree, however, I don't think a B-tree is a good idea.

Trees' main benefit is that they can be inserted to efficiently. When
all your data is loaded sequentially, in-order, in-memory and
immutable; the tree is pointless, more costly to build, and harder to
maintain - in terms of code complexity.

In this use case, the only benefit of B-trees would be that they're
optimized for disk access. If we planned to store this on-disk,
perhaps I'd grant you that. But we don't plan to do that, and it's not
even clear doing it would be efficient enough for the intended use.

On the other side, using B-trees incurs memory overhead due to the
need for internal nodes, can fragment memory because internal nodes
aren't the same size as leaf nodes, is easier to get wrong and
introduce bugs... I don't see a gain. If you propose its use, at least
benchmark it to show some gain.

So I don't think B-tree is a good idea, the sorted array already is
good enough, and if not, it's at least close to the earlier
implementation and less likely to introduce bugs.

Furthermore, among the 200-ish messages this thread has accumulated,
better ideas have been proposed, better because they do use less
memory and are faster (like using bitmaps when possible), but if we
can't push a simple refactoring first, there's no chance a bigger
rewrite will fare better. Remember, in this use case, using less
memory far outweights any other consideration. Less memory directly
translates to less iterations over the indexes, because more can be
crammed into m_w_m, which is a huge time saving. Far more than any
micro-optimization.

About 2 years ago, I chose to try to push this simple algorithm first,
then try to improve on it with better data structures. Nobody
complained at the time (I think, IIRC), and I don't think it fair to
go and revisit that now. It just delays getting a solution for this
issue for the persuit of "the perfect implementaiton" that might never
arrive. Or even if it doesn, there's nothing stopping us from pushing
another patch in the future with that better implementation if we
wish. Lets get something simple and proven first.

> I haven't done any performance testing of this (and not much testing in
> general), but at least the abstraction seems better this way. Performance
> testing would be good, too. In particular, I'd like to know how this might
> affect the performance of lazy_tid_reaped(). That's a hot spot when
> vacuuming indexes, so we don't want to add any cycles there. Was there any
> ready-made test kits on that in this thread? I didn't see any at a quick
> glance, but it's a long thread..

If you dig old messages in the thread, I had attached the scripts I
used for benchmarking this.

I'm attaching again one version of them (I've been modifying it to
suit my purposes at each review round), you'll probably want to tweak
it to build test cases good for your purpose here.

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Heikki Linnakangas
Date:
On 06/04/18 01:59, Claudio Freire wrote:
> The iteration interface, however, seems quite specific for the use
> case of vacuumlazy, so it's not really a good abstraction.

Can you elaborate? It does return the items one block at a time. Is that 
what you mean by being specific for vacuumlazy? I guess that's a bit 
special, but if you imagine some other users for this abstraction, it's 
probably not that unusual. For example, if we started using it in bitmap 
heap scans, a bitmap heap scan would also want to get the TIDs one block 
number at a time.

> It also copies stuff a lot, so it's quite heavyweight. I'd suggest
> trying to go for a lighter weight interface with less overhead that
> is more general at the same time.

Note that there was similar copying, to construct an array of 
OffsetNumbers, happening in lazy_vacuum_page() before this patch. So the 
net amount of copying is the same.

I'm envisioning that this data structure will sooner or later be 
optimized further, so that when you have a lot of TIDs pointing to the 
same block, we would pack them more tightly, storing the block number 
just once, with an array of offset numbers. This interface that returns 
an array of offset numbers matches that future well, as the iterator 
could just return a pointer to the array of offset numbers, with no 
copying. (If we end up doing something even more dense, like a bitmap, 
then it doesn't help, but that's ok too.)

> About the B-tree, however, I don't think a B-tree is a good idea.
> 
> Trees' main benefit is that they can be inserted to efficiently. When
> all your data is loaded sequentially, in-order, in-memory and
> immutable; the tree is pointless, more costly to build, and harder to
> maintain - in terms of code complexity.
> 
> In this use case, the only benefit of B-trees would be that they're
> optimized for disk access.

Those are not the reasons for which I'd prefer a B-tree. A B-tree has 
good cache locality, and when you don't need to worry about random 
insertions, page splits, deletions etc., it's also very simple to 
implement. This patch is not much longer than the segmented multi-array.

> On the other side, using B-trees incurs memory overhead due to the
> need for internal nodes, can fragment memory because internal nodes
> aren't the same size as leaf nodes, is easier to get wrong and
> introduce bugs... I don't see a gain.

The memory overhead incurred by the internal nodes is quite minimal, and 
can be adjusted by changing the node sizes. After some experimentation, 
I settled on 2048 items per leaf node, and 64 items per internal node. 
With those values, the overhead caused by the internal nodes is minimal, 
below 0.5%. That seems fine, but we could increase the node sizes to 
bring it further down, if we'd prefer that tradeoff.

I don't understand what memory fragmentation problems you're worried 
about. The tree grows one node at a time, as new TIDs are added, until 
it's all released at the end. I don't see how the size of internal vs 
leaf nodes matters.

> If you propose its use, at least benchmark it to show some gain.

Sure. I used the attached script to test this. It's inspired by the test 
script you posted. It creates a pgbench database with scale factor 100, 
deletes 80% of the rows, and runs vacuum. To stress lazy_tid_reaped() 
more heavily, the test script creates a number of extra indexes. Half of 
them are on the primary key, just to get more repetitions without having 
to re-initialize in between, and the rest are like this:

create index random_1 on pgbench_accounts((hashint4(aid)))

to stress lazy_vacuum_tid_reaped() with a random access pattern, rather 
than the sequential one that you get with the primary key index.

I ran the test script on my laptop, with unpatched master, with your 
latest multi-array patch, and with the attached version of the b-tree 
patch. The results are quite noisy, unfortunately, so I wouldn't draw 
very strong conclusions from it, but it seems that the performance of 
all three versions is roughly the same. I looked in particular at the 
CPU time spent in the index vacuums, as reported by VACUUM VERBOSE.

> Furthermore, among the 200-ish messages this thread has accumulated,
> better ideas have been proposed, better because they do use less
> memory and are faster (like using bitmaps when possible), but if we
> can't push a simple refactoring first, there's no chance a bigger
> rewrite will fare better. Remember, in this use case, using less
> memory far outweights any other consideration. Less memory directly
> translates to less iterations over the indexes, because more can be
> crammed into m_w_m, which is a huge time saving. Far more than any
> micro-optimization.
> 
> About 2 years ago, I chose to try to push this simple algorithm first,
> then try to improve on it with better data structures. Nobody
> complained at the time (I think, IIRC), and I don't think it fair to
> go and revisit that now. It just delays getting a solution for this
> issue for the persuit of "the perfect implementaiton" that might never
> arrive. Or even if it doesn, there's nothing stopping us from pushing
> another patch in the future with that better implementation if we
> wish. Lets get something simple and proven first.

True all that. My point is that the multi-segmented array isn't all that 
simple and proven, compared to an also straightforward B-tree. It's 
pretty similar to a B-tree, actually, except that it has exactly two 
levels, and the node (= segment) sizes grow exponentially. I'd rather go 
with a true B-tree, than something homegrown that resembles a B-tree, 
but not quite.

> I'm attaching again one version of them (I've been modifying it to
> suit my purposes at each review round), you'll probably want to tweak
> it to build test cases good for your purpose here.

Thanks!

Attached is a new version of my b-tree version. Compared to yesterday's 
version, I fixed a bunch of bugs that turned up in testing.

Looking at the changes to the regression test in this, I don't quite 
understand what it's all about. What are the "wait_barriers" for? If I 
understand correctly, they're added so that the VACUUMs can remove the 
tuples that are deleted in the test. But why are they needed now? Was 
that an orthogonal change we should've done anyway?

Rather than add those wait_barriers, should we stop running the 'vacuum' 
test in parallel with the other tests? Or maybe it's a good thing to run 
it in parallel, to test some other things?

What are the new tests supposed to cover? The test comment says "large 
mwm vacuum runs", and it sets maintenance_work_mem to 1 MB, which isn't 
very large

- Heikki

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Heikki Linnakangas
Date:
On 06/04/18 16:39, Heikki Linnakangas wrote:
> Sure. I used the attached script to test this. 

Sorry, I attached the wrong script. Here is the correct one that I used. 
Here are also the results I got from running it

- Heikki

Attachment

Re: Vacuum: allow usage of more than 1GB of work mem

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:
> On 06/04/18 01:59, Claudio Freire wrote:
> > The iteration interface, however, seems quite specific for the use
> > case of vacuumlazy, so it's not really a good abstraction.
> 
> Can you elaborate? It does return the items one block at a time. Is that
> what you mean by being specific for vacuumlazy? I guess that's a bit
> special, but if you imagine some other users for this abstraction, it's
> probably not that unusual. For example, if we started using it in bitmap
> heap scans, a bitmap heap scan would also want to get the TIDs one block
> number at a time.

FWIW I liked the idea of having this abstraction possibly do other
things -- for instance to vacuum brin indexes you'd like to mark index
tuples as "containing tuples that were removed" and eventually
re-summarize the range.  With the current interface we cannot do that,
because vacuum expects brin vacuuming to ask for each heap tuple "is
this tid dead?" and of course we don't have a list of tids to ask for.
So if we can ask instead "how many dead tuples does this block contain?"
brin vacuuming will be much happier.

> Looking at the changes to the regression test in this, I don't quite
> understand what it's all about. What are the "wait_barriers" for? If I
> understand correctly, they're added so that the VACUUMs can remove the
> tuples that are deleted in the test. But why are they needed now? Was that
> an orthogonal change we should've done anyway?
> 
> Rather than add those wait_barriers, should we stop running the 'vacuum'
> test in parallel with the other tests? Or maybe it's a good thing to run it
> in parallel, to test some other things?

20180207235226.zygu4r3yv3yfcnmc@alvherre.pgsql

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 06/04/18 01:59, Claudio Freire wrote:
>>
>> The iteration interface, however, seems quite specific for the use
>> case of vacuumlazy, so it's not really a good abstraction.
>
>
> Can you elaborate? It does return the items one block at a time. Is that
> what you mean by being specific for vacuumlazy? I guess that's a bit
> special, but if you imagine some other users for this abstraction, it's
> probably not that unusual. For example, if we started using it in bitmap
> heap scans, a bitmap heap scan would also want to get the TIDs one block
> number at a time.

But you're also tying the caller to the format of the buffer holding
those TIDs, for instance. Why would you, when you can have an
interface that just iterates TIDs and let the caller store them
if/however they want?

I do believe a pure iterator interface is a better interface.

>> It also copies stuff a lot, so it's quite heavyweight. I'd suggest
>> trying to go for a lighter weight interface with less overhead that
>> is more general at the same time.
>
>
> Note that there was similar copying, to construct an array of OffsetNumbers,
> happening in lazy_vacuum_page() before this patch. So the net amount of
> copying is the same.
>
> I'm envisioning that this data structure will sooner or later be optimized
> further, so that when you have a lot of TIDs pointing to the same block, we
> would pack them more tightly, storing the block number just once, with an
> array of offset numbers. This interface that returns an array of offset
> numbers matches that future well, as the iterator could just return a
> pointer to the array of offset numbers, with no copying. (If we end up doing
> something even more dense, like a bitmap, then it doesn't help, but that's
> ok too.)

But that's the thing. It's a specialized interface for a future we're
not certain. It's premature.

A generic interface does not preclude the possibility of implementing
those in the future, it allows you *not* to if there's no gain. Doing
it now, it forces you to.

>> About the B-tree, however, I don't think a B-tree is a good idea.
>>
>> Trees' main benefit is that they can be inserted to efficiently. When
>> all your data is loaded sequentially, in-order, in-memory and
>> immutable; the tree is pointless, more costly to build, and harder to
>> maintain - in terms of code complexity.
>>
>> In this use case, the only benefit of B-trees would be that they're
>> optimized for disk access.
>
>
> Those are not the reasons for which I'd prefer a B-tree. A B-tree has good
> cache locality, and when you don't need to worry about random insertions,
> page splits, deletions etc., it's also very simple to implement. This patch
> is not much longer than the segmented multi-array.

But it *is* complex and less tested. Testing it and making it mature
will take time. Why do that if doing bitmaps is a better path?

>> On the other side, using B-trees incurs memory overhead due to the
>> need for internal nodes, can fragment memory because internal nodes
>> aren't the same size as leaf nodes, is easier to get wrong and
>> introduce bugs... I don't see a gain.
>
>
> The memory overhead incurred by the internal nodes is quite minimal, and can
> be adjusted by changing the node sizes. After some experimentation, I
> settled on 2048 items per leaf node, and 64 items per internal node. With
> those values, the overhead caused by the internal nodes is minimal, below
> 0.5%. That seems fine, but we could increase the node sizes to bring it
> further down, if we'd prefer that tradeoff.
>
> I don't understand what memory fragmentation problems you're worried about.
> The tree grows one node at a time, as new TIDs are added, until it's all
> released at the end. I don't see how the size of internal vs leaf nodes
> matters.

Large vacuums do several passes, so they'll create one tid map for
each pass. Each pass will allocate m_w_m-worth of pages, and then
deallocate them.

B-tree page allocations are smaller than malloc's mmap threshold, so
freeing them won't return memory to the operating system. Furthermore,
if other allocations get interleaved, objects could be left lying at
random points in the heap, preventing efficient reuse of the heap for
the next round. In essence, internal fragmentation. This may be
exacerbated by AllocSet's own fragmentation and double-accounting.
From what I can tell, inner nodes will use pools, and leaf nodes will
be allocated as dedicated chunks (mallocd).

Segments in my implementation are big, exactly because of that. The
aim is to have large buffers that malloc will mmap, so they get
returned to the os (unmapped) when freed quickly, and with little
overhead.

This fragmentation may cause actual pain in autovacuum, since
autovacuum workers are relatively long-lived.

>> If you propose its use, at least benchmark it to show some gain.
>
>
> Sure. I used the attached script to test this. It's inspired by the test
> script you posted. It creates a pgbench database with scale factor 100,
> deletes 80% of the rows, and runs vacuum. To stress lazy_tid_reaped() more
> heavily, the test script creates a number of extra indexes. Half of them are
> on the primary key, just to get more repetitions without having to
> re-initialize in between, and the rest are like this:
>
> create index random_1 on pgbench_accounts((hashint4(aid)))
>
> to stress lazy_vacuum_tid_reaped() with a random access pattern, rather than
> the sequential one that you get with the primary key index.
>
> I ran the test script on my laptop, with unpatched master, with your latest
> multi-array patch, and with the attached version of the b-tree patch. The
> results are quite noisy, unfortunately, so I wouldn't draw very strong
> conclusions from it, but it seems that the performance of all three versions
> is roughly the same. I looked in particular at the CPU time spent in the
> index vacuums, as reported by VACUUM VERBOSE.

Scale factor 100 is hardly enough to stress large m_w_m vacuum. I
found scales of 1k-4k (or bigger) are best, with large m_w_m settings
(4G for example) to get to really see how the data structure performs.

>> Furthermore, among the 200-ish messages this thread has accumulated,
>> better ideas have been proposed, better because they do use less
>> memory and are faster (like using bitmaps when possible), but if we
>> can't push a simple refactoring first, there's no chance a bigger
>> rewrite will fare better. Remember, in this use case, using less
>> memory far outweights any other consideration. Less memory directly
>> translates to less iterations over the indexes, because more can be
>> crammed into m_w_m, which is a huge time saving. Far more than any
>> micro-optimization.
>>
>> About 2 years ago, I chose to try to push this simple algorithm first,
>> then try to improve on it with better data structures. Nobody
>> complained at the time (I think, IIRC), and I don't think it fair to
>> go and revisit that now. It just delays getting a solution for this
>> issue for the persuit of "the perfect implementaiton" that might never
>> arrive. Or even if it doesn, there's nothing stopping us from pushing
>> another patch in the future with that better implementation if we
>> wish. Lets get something simple and proven first.
>
>
> True all that. My point is that the multi-segmented array isn't all that
> simple and proven, compared to an also straightforward B-tree. It's pretty
> similar to a B-tree, actually, except that it has exactly two levels, and
> the node (= segment) sizes grow exponentially. I'd rather go with a true
> B-tree, than something homegrown that resembles a B-tree, but not quite.

I disagree.

Being similar to what vacuum is already doing, we can be confident the
approach is sound, at least as sound as current vacuum. It shares a
lot with the current implementation, which is known to be good.

The multi-segmented array itself has received a lot of testing during
the ~1.5y it has spent in the making, as well. I've been running
extensive benchmarking and tests each time I changed something, and
I've even ran deep tests on an actual production snapshot. A complete
db-wide vacuum of a heavily bloated ~12TB production database, showing
significant speedups not only because it does fewer index scans, but
also because it uses less CPU time to do so, with consistency checking
after the fact, to check for bugs. Of course I can't guarantee it's
bug-free, but it *is* decently tested.

I'm pretty sure the B-tree implementation hasn't reached that level of
testing yet. It might in the future, but it won't happen overnight.

Your B-tree patch is also homegrown. You're not reusing well tested
btree code, you're coding a B-tree from scratch, so it's as suspect as
any new code. I agree the multi-segment algorithm is quite similar to
a shallow b-tree, but I'm not convinced a b-tree is what we must
aspire to have. In fact, if you used large pages for the B-tree, you
don't need more than 2 levels (there's a 12GB limit ATM on the size of
the tid map), so the multi-segment approach and the b-tree approach
are essentially the same. Except the multi-segment code got more
testing.

In short, there are other more enticing alternatives to try out first.
I'm not enthused by the idea of having to bench and test yet another
sorted set implementation before moving forward.

On Fri, Apr 6, 2018 at 11:00 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Heikki Linnakangas wrote:
>> On 06/04/18 01:59, Claudio Freire wrote:
>> > The iteration interface, however, seems quite specific for the use
>> > case of vacuumlazy, so it's not really a good abstraction.
>>
>> Can you elaborate? It does return the items one block at a time. Is that
>> what you mean by being specific for vacuumlazy? I guess that's a bit
>> special, but if you imagine some other users for this abstraction, it's
>> probably not that unusual. For example, if we started using it in bitmap
>> heap scans, a bitmap heap scan would also want to get the TIDs one block
>> number at a time.
>
> FWIW I liked the idea of having this abstraction possibly do other
> things -- for instance to vacuum brin indexes you'd like to mark index
> tuples as "containing tuples that were removed" and eventually
> re-summarize the range.  With the current interface we cannot do that,
> because vacuum expects brin vacuuming to ask for each heap tuple "is
> this tid dead?" and of course we don't have a list of tids to ask for.
> So if we can ask instead "how many dead tuples does this block contain?"
> brin vacuuming will be much happier.

I don't think either patch gives you that.

The bulkdelete interface is part of the indexam and unlikely to change
in this patch.


Re: Vacuum: allow usage of more than 1GB of work mem

From
Alvaro Herrera
Date:
Claudio Freire wrote:

> On Fri, Apr 6, 2018 at 11:00 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > FWIW I liked the idea of having this abstraction possibly do other
> > things -- for instance to vacuum brin indexes you'd like to mark index
> > tuples as "containing tuples that were removed" and eventually
> > re-summarize the range.  With the current interface we cannot do that,
> > because vacuum expects brin vacuuming to ask for each heap tuple "is
> > this tid dead?" and of course we don't have a list of tids to ask for.
> > So if we can ask instead "how many dead tuples does this block contain?"
> > brin vacuuming will be much happier.
> 
> I don't think either patch gives you that.
> 
> The bulkdelete interface is part of the indexam and unlikely to change
> in this patch.

I'm sure you're correct.  I was just saying that with the abstract
interface it is easier to implement what I suggest as a follow-on patch.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> On 06/04/18 01:59, Claudio Freire wrote:
>>>
>>> The iteration interface, however, seems quite specific for the use
>>> case of vacuumlazy, so it's not really a good abstraction.
>>
>>
>> Can you elaborate? It does return the items one block at a time. Is that
>> what you mean by being specific for vacuumlazy? I guess that's a bit
>> special, but if you imagine some other users for this abstraction, it's
>> probably not that unusual. For example, if we started using it in bitmap
>> heap scans, a bitmap heap scan would also want to get the TIDs one block
>> number at a time.
>
> But you're also tying the caller to the format of the buffer holding
> those TIDs, for instance. Why would you, when you can have an
> interface that just iterates TIDs and let the caller store them
> if/however they want?
>
> I do believe a pure iterator interface is a better interface.

Between the b-tree or not discussion and the refactoring to separate
the code, I don't think we'll get this in the next 24hs.

So I guess we'll have ample time to poner on both issues during the
next commit fest.


Re: Vacuum: allow usage of more than 1GB of work mem

From
Andrew Dunstan
Date:

On 04/06/2018 08:00 PM, Claudio Freire wrote:
> On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> On 06/04/18 01:59, Claudio Freire wrote:
>>>> The iteration interface, however, seems quite specific for the use
>>>> case of vacuumlazy, so it's not really a good abstraction.
>>>
>>> Can you elaborate? It does return the items one block at a time. Is that
>>> what you mean by being specific for vacuumlazy? I guess that's a bit
>>> special, but if you imagine some other users for this abstraction, it's
>>> probably not that unusual. For example, if we started using it in bitmap
>>> heap scans, a bitmap heap scan would also want to get the TIDs one block
>>> number at a time.
>> But you're also tying the caller to the format of the buffer holding
>> those TIDs, for instance. Why would you, when you can have an
>> interface that just iterates TIDs and let the caller store them
>> if/however they want?
>>
>> I do believe a pure iterator interface is a better interface.
> Between the b-tree or not discussion and the refactoring to separate
> the code, I don't think we'll get this in the next 24hs.
>
> So I guess we'll have ample time to poner on both issues during the
> next commit fest.
>



There doesn't seem to have been much pondering done since then, at least 
publicly. Can we make some progress on this? It's been around for a long 
time now.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Thu, Jul 12, 2018 at 10:44 AM Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
>
>
>
> On 04/06/2018 08:00 PM, Claudio Freire wrote:
> > On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> >> On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >>> On 06/04/18 01:59, Claudio Freire wrote:
> >>>> The iteration interface, however, seems quite specific for the use
> >>>> case of vacuumlazy, so it's not really a good abstraction.
> >>>
> >>> Can you elaborate? It does return the items one block at a time. Is that
> >>> what you mean by being specific for vacuumlazy? I guess that's a bit
> >>> special, but if you imagine some other users for this abstraction, it's
> >>> probably not that unusual. For example, if we started using it in bitmap
> >>> heap scans, a bitmap heap scan would also want to get the TIDs one block
> >>> number at a time.
> >> But you're also tying the caller to the format of the buffer holding
> >> those TIDs, for instance. Why would you, when you can have an
> >> interface that just iterates TIDs and let the caller store them
> >> if/however they want?
> >>
> >> I do believe a pure iterator interface is a better interface.
> > Between the b-tree or not discussion and the refactoring to separate
> > the code, I don't think we'll get this in the next 24hs.
> >
> > So I guess we'll have ample time to poner on both issues during the
> > next commit fest.
> >
>
>
>
> There doesn't seem to have been much pondering done since then, at least
> publicly. Can we make some progress on this? It's been around for a long
> time now.

Yeah, life has kept me busy and I haven't had much time to make
progress here, but I was planning on doing the refactoring as we were
discussing soon. Can't give a time frame for that, but "soonish".


Re: Vacuum: allow usage of more than 1GB of work mem

From
Andrew Dunstan
Date:

On 07/12/2018 12:38 PM, Claudio Freire wrote:
> On Thu, Jul 12, 2018 at 10:44 AM Andrew Dunstan
> <andrew.dunstan@2ndquadrant.com> wrote:
>>
>>
>> On 04/06/2018 08:00 PM, Claudio Freire wrote:
>>> On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>>> On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>>>> On 06/04/18 01:59, Claudio Freire wrote:
>>>>>> The iteration interface, however, seems quite specific for the use
>>>>>> case of vacuumlazy, so it's not really a good abstraction.
>>>>> Can you elaborate? It does return the items one block at a time. Is that
>>>>> what you mean by being specific for vacuumlazy? I guess that's a bit
>>>>> special, but if you imagine some other users for this abstraction, it's
>>>>> probably not that unusual. For example, if we started using it in bitmap
>>>>> heap scans, a bitmap heap scan would also want to get the TIDs one block
>>>>> number at a time.
>>>> But you're also tying the caller to the format of the buffer holding
>>>> those TIDs, for instance. Why would you, when you can have an
>>>> interface that just iterates TIDs and let the caller store them
>>>> if/however they want?
>>>>
>>>> I do believe a pure iterator interface is a better interface.
>>> Between the b-tree or not discussion and the refactoring to separate
>>> the code, I don't think we'll get this in the next 24hs.
>>>
>>> So I guess we'll have ample time to poner on both issues during the
>>> next commit fest.
>>>
>>
>>
>> There doesn't seem to have been much pondering done since then, at least
>> publicly. Can we make some progress on this? It's been around for a long
>> time now.
> Yeah, life has kept me busy and I haven't had much time to make
> progress here, but I was planning on doing the refactoring as we were
> discussing soon. Can't give a time frame for that, but "soonish".

I fully understand. I think this needs to go back to "Waiting on Author".

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

From
Alvaro Herrera
Date:
On 2018-Jul-12, Andrew Dunstan wrote:

> I fully understand. I think this needs to go back to "Waiting on Author".

Why?  Heikki's patch applies fine and passes the regression tests.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Vacuum: allow usage of more than 1GB of work mem

From
Andrew Dunstan
Date:

On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
> On 2018-Jul-12, Andrew Dunstan wrote:
>
>> I fully understand. I think this needs to go back to "Waiting on Author".
> Why?  Heikki's patch applies fine and passes the regression tests.
>


Well, I understood Claudio was going to do some more work (see 
upthread). If we're going to go with Heikki's patch then do we need to 
change the author, or add him as an author?

cheers

andew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

From
Heikki Linnakangas
Date:
On 13/07/18 01:39, Andrew Dunstan wrote:
> On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
>> On 2018-Jul-12, Andrew Dunstan wrote:
>>
>>> I fully understand. I think this needs to go back to "Waiting on Author".
>> Why?  Heikki's patch applies fine and passes the regression tests.
> 
> Well, I understood Claudio was going to do some more work (see
> upthread).

Claudio raised a good point, that doing small pallocs leads to 
fragmentation, and in particular, it might mean that we can't give back 
the memory to the OS. The default glibc malloc() implementation has a 
threshold of 4 or 32 MB or something like that - allocations larger than 
the threshold are mmap()'d, and can always be returned to the OS. I 
think a simple solution to that is to allocate larger chunks, something 
like 32-64 MB at a time, and carve out the allocations for the nodes 
from those chunks. That's pretty straightforward, because we don't need 
to worry about freeing the nodes in retail. Keep track of the current 
half-filled chunk, and allocate a new one when it fills up.

He also wanted to refactor the iterator API, to return one ItemPointer 
at a time. I don't think that's necessary, the current iterator API is 
more convenient for the callers, but I don't feel strongly about that.

Anything else?

> If we're going to go with Heikki's patch then do we need to
> change the author, or add him as an author?

Let's list both of us. At least in the commit message, doesn't matter 
much what the commitfest app says.

- Heikki


Re: Vacuum: allow usage of more than 1GB of work mem

From
Andrew Dunstan
Date:

On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
> On 13/07/18 01:39, Andrew Dunstan wrote:
>> On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
>>> On 2018-Jul-12, Andrew Dunstan wrote:
>>>
>>>> I fully understand. I think this needs to go back to "Waiting on 
>>>> Author".
>>> Why?  Heikki's patch applies fine and passes the regression tests.
>>
>> Well, I understood Claudio was going to do some more work (see
>> upthread).
>
> Claudio raised a good point, that doing small pallocs leads to 
> fragmentation, and in particular, it might mean that we can't give 
> back the memory to the OS. The default glibc malloc() implementation 
> has a threshold of 4 or 32 MB or something like that - allocations 
> larger than the threshold are mmap()'d, and can always be returned to 
> the OS. I think a simple solution to that is to allocate larger 
> chunks, something like 32-64 MB at a time, and carve out the 
> allocations for the nodes from those chunks. That's pretty 
> straightforward, because we don't need to worry about freeing the 
> nodes in retail. Keep track of the current half-filled chunk, and 
> allocate a new one when it fills up.


Google seems to suggest the default threshold is much lower, like 128K. 
Still, making larger allocations seems sensible. Are you going to work 
on that?


>
> He also wanted to refactor the iterator API, to return one ItemPointer 
> at a time. I don't think that's necessary, the current iterator API is 
> more convenient for the callers, but I don't feel strongly about that.
>
> Anything else?
>
>> If we're going to go with Heikki's patch then do we need to
>> change the author, or add him as an author?
>
> Let's list both of us. At least in the commit message, doesn't matter 
> much what the commitfest app says.
>


I added you as an author in the CF App

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
>
>
>
> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
> > On 13/07/18 01:39, Andrew Dunstan wrote:
> >> On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
> >>> On 2018-Jul-12, Andrew Dunstan wrote:
> >>>
> >>>> I fully understand. I think this needs to go back to "Waiting on
> >>>> Author".
> >>> Why?  Heikki's patch applies fine and passes the regression tests.
> >>
> >> Well, I understood Claudio was going to do some more work (see
> >> upthread).
> >
> > Claudio raised a good point, that doing small pallocs leads to
> > fragmentation, and in particular, it might mean that we can't give
> > back the memory to the OS. The default glibc malloc() implementation
> > has a threshold of 4 or 32 MB or something like that - allocations
> > larger than the threshold are mmap()'d, and can always be returned to
> > the OS. I think a simple solution to that is to allocate larger
> > chunks, something like 32-64 MB at a time, and carve out the
> > allocations for the nodes from those chunks. That's pretty
> > straightforward, because we don't need to worry about freeing the
> > nodes in retail. Keep track of the current half-filled chunk, and
> > allocate a new one when it fills up.
>
>
> Google seems to suggest the default threshold is much lower, like 128K.
> Still, making larger allocations seems sensible. Are you going to work
> on that?

Below a few MB the threshold is dynamic, and if a block bigger than
128K but smaller than the higher threshold (32-64MB IIRC) is freed,
the dynamic threshold is set to the size of the freed block.

See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]

So I'd suggest allocating blocks bigger than M_MMAP_MAX.

[1] http://man7.org/linux/man-pages/man3/mallopt.3.html


Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire <klaussfreire@gmail.com> wrote:
>
> On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
> <andrew.dunstan@2ndquadrant.com> wrote:
> >
> >
> >
> > On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
> > > On 13/07/18 01:39, Andrew Dunstan wrote:
> > >> On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
> > >>> On 2018-Jul-12, Andrew Dunstan wrote:
> > >>>
> > >>>> I fully understand. I think this needs to go back to "Waiting on
> > >>>> Author".
> > >>> Why?  Heikki's patch applies fine and passes the regression tests.
> > >>
> > >> Well, I understood Claudio was going to do some more work (see
> > >> upthread).
> > >
> > > Claudio raised a good point, that doing small pallocs leads to
> > > fragmentation, and in particular, it might mean that we can't give
> > > back the memory to the OS. The default glibc malloc() implementation
> > > has a threshold of 4 or 32 MB or something like that - allocations
> > > larger than the threshold are mmap()'d, and can always be returned to
> > > the OS. I think a simple solution to that is to allocate larger
> > > chunks, something like 32-64 MB at a time, and carve out the
> > > allocations for the nodes from those chunks. That's pretty
> > > straightforward, because we don't need to worry about freeing the
> > > nodes in retail. Keep track of the current half-filled chunk, and
> > > allocate a new one when it fills up.
> >
> >
> > Google seems to suggest the default threshold is much lower, like 128K.
> > Still, making larger allocations seems sensible. Are you going to work
> > on that?
>
> Below a few MB the threshold is dynamic, and if a block bigger than
> 128K but smaller than the higher threshold (32-64MB IIRC) is freed,
> the dynamic threshold is set to the size of the freed block.
>
> See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]
>
> So I'd suggest allocating blocks bigger than M_MMAP_MAX.
>
> [1] http://man7.org/linux/man-pages/man3/mallopt.3.html

Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the
former is something else.


Re: Vacuum: allow usage of more than 1GB of work mem

From
Heikki Linnakangas
Date:
On 16/07/18 18:35, Claudio Freire wrote:
> On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
>> <andrew.dunstan@2ndquadrant.com> wrote:
>>> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
>>>> Claudio raised a good point, that doing small pallocs leads to
>>>> fragmentation, and in particular, it might mean that we can't give
>>>> back the memory to the OS. The default glibc malloc() implementation
>>>> has a threshold of 4 or 32 MB or something like that - allocations
>>>> larger than the threshold are mmap()'d, and can always be returned to
>>>> the OS. I think a simple solution to that is to allocate larger
>>>> chunks, something like 32-64 MB at a time, and carve out the
>>>> allocations for the nodes from those chunks. That's pretty
>>>> straightforward, because we don't need to worry about freeing the
>>>> nodes in retail. Keep track of the current half-filled chunk, and
>>>> allocate a new one when it fills up.
>>>
>>> Google seems to suggest the default threshold is much lower, like 128K.
>>> Still, making larger allocations seems sensible. Are you going to work
>>> on that?
>>
>> Below a few MB the threshold is dynamic, and if a block bigger than
>> 128K but smaller than the higher threshold (32-64MB IIRC) is freed,
>> the dynamic threshold is set to the size of the freed block.
>>
>> See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]
>>
>> So I'd suggest allocating blocks bigger than M_MMAP_MAX.
>>
>> [1] http://man7.org/linux/man-pages/man3/mallopt.3.html
> 
> Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the
> former is something else.

Yeah, we basically want to be well above whatever the threshold is. I 
don't think we should try to check for any specific constant, just make 
it large enough. Different libc implementations might have different 
policies, too. There's little harm in overshooting, and making e.g. 64 
MB allocations when 1 MB would've been enough to trigger the mmap() 
behavior. It's going to be more granular than the current situation, 
anyway, where we do a single massive allocation.

(A code comment to briefly mention the thresholds on common platforms 
would be good, though).

- Heikki


Re: Vacuum: allow usage of more than 1GB of work mem

From
Andrew Dunstan
Date:

On 07/16/2018 10:34 AM, Claudio Freire wrote:
> On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
> <andrew.dunstan@2ndquadrant.com> wrote:
>>
>>
>> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
>>> On 13/07/18 01:39, Andrew Dunstan wrote:
>>>> On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
>>>>> On 2018-Jul-12, Andrew Dunstan wrote:
>>>>>
>>>>>> I fully understand. I think this needs to go back to "Waiting on
>>>>>> Author".
>>>>> Why?  Heikki's patch applies fine and passes the regression tests.
>>>> Well, I understood Claudio was going to do some more work (see
>>>> upthread).
>>> Claudio raised a good point, that doing small pallocs leads to
>>> fragmentation, and in particular, it might mean that we can't give
>>> back the memory to the OS. The default glibc malloc() implementation
>>> has a threshold of 4 or 32 MB or something like that - allocations
>>> larger than the threshold are mmap()'d, and can always be returned to
>>> the OS. I think a simple solution to that is to allocate larger
>>> chunks, something like 32-64 MB at a time, and carve out the
>>> allocations for the nodes from those chunks. That's pretty
>>> straightforward, because we don't need to worry about freeing the
>>> nodes in retail. Keep track of the current half-filled chunk, and
>>> allocate a new one when it fills up.
>>
>> Google seems to suggest the default threshold is much lower, like 128K.
>> Still, making larger allocations seems sensible. Are you going to work
>> on that?
> Below a few MB the threshold is dynamic, and if a block bigger than
> 128K but smaller than the higher threshold (32-64MB IIRC) is freed,
> the dynamic threshold is set to the size of the freed block.
>
> See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]
>
> So I'd suggest allocating blocks bigger than M_MMAP_MAX.
>
> [1] http://man7.org/linux/man-pages/man3/mallopt.3.html


That page says:

        M_MMAP_MAX
               This parameter specifies the maximum number of allocation
               requests that may be simultaneously serviced using mmap(2).
               This parameter exists because some systems have a limited
               number of internal tables for use by mmap(2), and using more
               than a few of them may degrade performance.

               The default value is 65,536, a value which has no special
               significance and which serves only as a safeguard. Setting
               this parameter to 0 disables the use of mmap(2) for servicing
               large allocation requests.


I'm confused about the relevance.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

From
Claudio Freire
Date:
On Mon, Jul 16, 2018 at 3:30 PM Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
>
>
>
> On 07/16/2018 10:34 AM, Claudio Freire wrote:
> > On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
> > <andrew.dunstan@2ndquadrant.com> wrote:
> >>
> >>
> >> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
> >>> On 13/07/18 01:39, Andrew Dunstan wrote:
> >>>> On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
> >>>>> On 2018-Jul-12, Andrew Dunstan wrote:
> >>>>>
> >>>>>> I fully understand. I think this needs to go back to "Waiting on
> >>>>>> Author".
> >>>>> Why?  Heikki's patch applies fine and passes the regression tests.
> >>>> Well, I understood Claudio was going to do some more work (see
> >>>> upthread).
> >>> Claudio raised a good point, that doing small pallocs leads to
> >>> fragmentation, and in particular, it might mean that we can't give
> >>> back the memory to the OS. The default glibc malloc() implementation
> >>> has a threshold of 4 or 32 MB or something like that - allocations
> >>> larger than the threshold are mmap()'d, and can always be returned to
> >>> the OS. I think a simple solution to that is to allocate larger
> >>> chunks, something like 32-64 MB at a time, and carve out the
> >>> allocations for the nodes from those chunks. That's pretty
> >>> straightforward, because we don't need to worry about freeing the
> >>> nodes in retail. Keep track of the current half-filled chunk, and
> >>> allocate a new one when it fills up.
> >>
> >> Google seems to suggest the default threshold is much lower, like 128K.
> >> Still, making larger allocations seems sensible. Are you going to work
> >> on that?
> > Below a few MB the threshold is dynamic, and if a block bigger than
> > 128K but smaller than the higher threshold (32-64MB IIRC) is freed,
> > the dynamic threshold is set to the size of the freed block.
> >
> > See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]
> >
> > So I'd suggest allocating blocks bigger than M_MMAP_MAX.
> >
> > [1] http://man7.org/linux/man-pages/man3/mallopt.3.html
>
>
> That page says:
>
>         M_MMAP_MAX
>                This parameter specifies the maximum number of allocation
>                requests that may be simultaneously serviced using mmap(2).
>                This parameter exists because some systems have a limited
>                number of internal tables for use by mmap(2), and using more
>                than a few of them may degrade performance.
>
>                The default value is 65,536, a value which has no special
>                significance and which serves only as a safeguard. Setting
>                this parameter to 0 disables the use of mmap(2) for servicing
>                large allocation requests.
>
>
> I'm confused about the relevance.

It isn't relevant. See my next message, it should have read
DEFAULT_MMAP_THRESHOLD_MAX.


Re: Vacuum: allow usage of more than 1GB of work mem

From
Andrew Dunstan
Date:

On 07/16/2018 11:35 AM, Claudio Freire wrote:
> On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
>> <andrew.dunstan@2ndquadrant.com> wrote:
>>>
>>>
>>> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
>>>> On 13/07/18 01:39, Andrew Dunstan wrote:
>>>>> On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
>>>>>> On 2018-Jul-12, Andrew Dunstan wrote:
>>>>>>
>>>>>>> I fully understand. I think this needs to go back to "Waiting on
>>>>>>> Author".
>>>>>> Why?  Heikki's patch applies fine and passes the regression tests.
>>>>> Well, I understood Claudio was going to do some more work (see
>>>>> upthread).
>>>> Claudio raised a good point, that doing small pallocs leads to
>>>> fragmentation, and in particular, it might mean that we can't give
>>>> back the memory to the OS. The default glibc malloc() implementation
>>>> has a threshold of 4 or 32 MB or something like that - allocations
>>>> larger than the threshold are mmap()'d, and can always be returned to
>>>> the OS. I think a simple solution to that is to allocate larger
>>>> chunks, something like 32-64 MB at a time, and carve out the
>>>> allocations for the nodes from those chunks. That's pretty
>>>> straightforward, because we don't need to worry about freeing the
>>>> nodes in retail. Keep track of the current half-filled chunk, and
>>>> allocate a new one when it fills up.
>>>
>>> Google seems to suggest the default threshold is much lower, like 128K.
>>> Still, making larger allocations seems sensible. Are you going to work
>>> on that?
>> Below a few MB the threshold is dynamic, and if a block bigger than
>> 128K but smaller than the higher threshold (32-64MB IIRC) is freed,
>> the dynamic threshold is set to the size of the freed block.
>>
>> See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]
>>
>> So I'd suggest allocating blocks bigger than M_MMAP_MAX.
>>
>> [1] http://man7.org/linux/man-pages/man3/mallopt.3.html
> Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the
> former is something else.


Ah, ok. Thanks. ignore the email I just sent about that.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

From
Robert Haas
Date:
On Fri, Apr 6, 2018 at 4:25 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> True all that. My point is that the multi-segmented array isn't all that
>> simple and proven, compared to an also straightforward B-tree. It's pretty
>> similar to a B-tree, actually, except that it has exactly two levels, and
>> the node (= segment) sizes grow exponentially. I'd rather go with a true
>> B-tree, than something homegrown that resembles a B-tree, but not quite.
>
> I disagree.

Yeah, me too.  I think a segmented array is a lot simpler than a
home-grown btree.  I wrote a home-grown btree that ended up becoming
src/backend/utils/mmgr/freepage.c and it took me a long time to get
rid of all the bugs.  Heikki is almost certainly better at coding up a
bug-free btree than I am, but a segmented array is a dead simple data
structure, or should be if done properly, and a btree is not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Vacuum: allow usage of more than 1GB of work mem

From
Michael Paquier
Date:
On Mon, Jul 16, 2018 at 02:33:17PM -0400, Andrew Dunstan wrote:
> Ah, ok. Thanks. ignore the email I just sent about that.

So...  This thread has basically died of inactivity, while there have
been a couple of interesting things discussed, like the version from
Heikki here:
https://www.postgresql.org/message-id/cd8f7b62-17e1-4307-9f81-427922e5a1f6@iki.fi

I am marking the patches as returned with feedback for now.
--
Michael

Attachment