Thread: Why percent_rank is so slower than rank?

Why percent_rank is so slower than rank?

From
Jie Li
Date:
Hi all, <br /><br />I'm new to window functions. Recently I run some simple queries but surprised to find percent_rank
isso slower than rank, could anybody tell me why?<br /><br />The table schema:<br />test=# \d inventory1<br />         
Table"public.inventory1"<br />        Column        |  Type   | Modifiers <br
/>----------------------+---------+-----------<br/> inv_date_sk          | integer | not null<br
/> inv_item_sk         | integer | not null<br />  inv_warehouse_sk     | integer | not null<br /> inv_quantity_on_hand
|integer | <br /><br />test=# \dt+ inventory1<br />                       List of relations<br /> Schema |    Name    |
Type |  Owner   |  Size   | Description <br /> --------+------------+-------+----------+---------+-------------<br
/> public| inventory1 | table | workshop | 8880 kB | <br /><br />The rank query result:<br />test=# explain analyze
selectinv_date_sk,inv_item_sk, rank()over(partition by inv_date_sk order by inv_item_sk) from inventory1;<br />
                                                         QUERY
PLAN                                                          <br
/>-------------------------------------------------------------------------------------------------------------------------------<br
/> WindowAgg  (cost=19563.99..23343.99 rows=189000 width=8) (actual time=631.947..1361.158 rows=189000 loops=1)<br />  
-> Sort  (cost=19563.99..20036.49 rows=189000 width=8) (actual time=631.924..771.990 rows=189000 loops=1)<br />
        Sort Key: inv_date_sk, inv_item_sk<br />         Sort Method:  quicksort  Memory: 12218kB<br />         -> 
SeqScan on inventory1  (cost=0.00..3000.00 rows=189000 width=8) (actual time=0.055..198.948 rows=189000 loops=1)<br />
 Totalruntime: 1500.193 ms<br />(6 rows)<br /><br />The percent_rank result:<br />test=# explain analyze select
inv_date_sk,inv_item_sk,percent_rank()over(partition by inv_date_sk order by inv_item_sk) from inventory1;<br
/>                                                         QUERY
PLAN                                                          <br />
-------------------------------------------------------------------------------------------------------------------------------<br
/> WindowAgg (cost=19563.99..23343.99 rows=189000 width=8) (actual time=766.432..32924.804 rows=189000 loops=1)<br />
  ->  Sort  (cost=19563.99..20036.49 rows=189000 width=8) (actual time=756.320..905.407 rows=189000 loops=1)<br
/>        Sort Key: inv_date_sk, inv_item_sk<br />         Sort Method:  quicksort  Memory: 12218kB<br />        
-> Seq Scan on inventory1  (cost=0.00..3000.00 rows=189000 width=8) (actual time=0.102..224.607 rows=189000
loops=1)<br/>  Total runtime: 33152.188 ms<br />(6 rows)<br /><br />One special thing is that all the values of the
partitionkey(inv_date_sk) are the same, that is, there is only one window partition. I find that percent_rank needs to
bufferall the tuples to get the total number of rows. But why is it so expensive?<br /><br />I use 8.4.4. And I only
increasethe work_mem to 100M and leave other parameters untouched. <br /><br />Thanks,<br />Li Jie<br /> 

Re: Why percent_rank is so slower than rank?

From
Tom Lane
Date:
Jie Li <jay23jack@gmail.com> writes:
> I'm new to window functions. Recently I run some simple queries but
> surprised to find percent_rank is so slower than rank, could anybody tell me
> why?

Huh, interesting.  I can reproduce this with toy data, such as

create table inventory1 (inv_date_sk int, inv_item_sk int);
insert into inventory1 select 1, random()* 100000 from generate_series(1,189000);
explain analyze select inv_date_sk,inv_item_sk, percent_rank()over(partition by inv_date_sk order by inv_item_sk) from
inventory1;

The example is *not* particularly slow if you leave work_mem at default.
But if you bump up work_mem enough so that the WindowAgg's internal
tuplestore fits into memory, it slows down like crazy.  A bit of quality
time with oprofile shows that all the time is going into this memmove()
in tuplestore_trim():
   /*    * Slide the array down and readjust pointers.  This may look pretty    * stupid, but we expect that there will
usuallynot be very many    * tuple-pointers to move, so this isn't that expensive; and it keeps a    * lot of other
logicsimple.    *    * In fact, in the current usage for merge joins, it's demonstrable that    * there will always be
exactlyone non-removed tuple; so optimize that    * case.    */   if (nremove + 1 == state->memtupcount)
state->memtuples[0]= state->memtuples[nremove];   else       memmove(state->memtuples, state->memtuples + nremove,
        (state->memtupcount - nremove) * sizeof(void *));
 

We're throwing away one tuple at a time as we advance forward through
the tuplestore, and moving 100000+ tuple pointers each time.  Ugh.
This code was all right when written, because (IIRC) the mergejoin
case was actually the only caller.  But it's not all right for
WindowAgg's less-predictable usage patterns.

I thought for a bit about changing things around so that the first-used
tuple slot isn't necessarily state->memtuples[0], but just like the
comment says, that complicates a lot of other logic.  And there isn't
any easy place to reclaim the wasted slots later.

What seems like the best bet is to put in a heuristic to make
tuplestore_trim simply not do anything until nremove reaches some
reasonably large amount, perhaps 10% of the number of stored tuples.
This wastes up to 10% of the alloted memory, but that seems tolerable.
We could complicate things a bit more by remembering that so-and-so
many slots are authorized to be removed, and forcing a trim operation
to discard them if we find ourselves in memory trouble.  I'm not sure
that extra complication is worthwhile though.  Comments?
        regards, tom lane


Re: Why percent_rank is so slower than rank?

From
Tom Lane
Date:
I wrote:
> We're throwing away one tuple at a time as we advance forward through
> the tuplestore, and moving 100000+ tuple pointers each time.  Ugh.
> This code was all right when written, because (IIRC) the mergejoin
> case was actually the only caller.  But it's not all right for
> WindowAgg's less-predictable usage patterns.

> I thought for a bit about changing things around so that the first-used
> tuple slot isn't necessarily state->memtuples[0], but just like the
> comment says, that complicates a lot of other logic.  And there isn't
> any easy place to reclaim the wasted slots later.

> What seems like the best bet is to put in a heuristic to make
> tuplestore_trim simply not do anything until nremove reaches some
> reasonably large amount, perhaps 10% of the number of stored tuples.
> This wastes up to 10% of the alloted memory, but that seems tolerable.

On reflection I think just not doing anything isn't a very good idea.
The problem with that is that a mis-coded caller could try to fetch
tuples that it had already told the tuplestore could be trimmed away;
and this would work, most of the time, until you got unlucky and the
trim operation had actually deleted them.  I think it's pretty important
for bug-catching purposes that the tuplestore enforce that those tuples
are not available anymore.

Hence the attached patch, which combines the two ideas by recycling
tuples immediately but not sliding the pointer array until a reasonable
amount of movement has occurred.  This fixes the complained-of
performance problem AFAICT.

I'm not sure whether or not to back-patch this into 9.0 and 8.4.  The
code in tuplestore.c hasn't changed at all since 8.4, so there's not
much risk of cross-version bugs, but if I did miss anything we could
be shipping a buggy version next week.  Thoughts?

            regards, tom lane

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 9bbaba43771f495fdf24e9f2afd545b69a22ecbd..8c8139c897679892e0d4ad13e69ae8d814484206 100644
*** a/src/backend/utils/sort/tuplestore.c
--- b/src/backend/utils/sort/tuplestore.c
*************** struct Tuplestorestate
*** 145,152 ****
--- 145,159 ----
      /*
       * This array holds pointers to tuples in memory if we are in state INMEM.
       * In states WRITEFILE and READFILE it's not used.
+      *
+      * When memtupdeleted > 0, the first memtupdeleted pointers are already
+      * released due to a tuplestore_trim() operation, but we haven't expended
+      * the effort to slide the remaining pointers down.  These unused pointers
+      * are set to NULL to catch any invalid accesses.  Note that memtupcount
+      * includes the deleted pointers.
       */
      void      **memtuples;        /* array of pointers to palloc'd tuples */
+     int            memtupdeleted;    /* the first N slots are currently unused */
      int            memtupcount;    /* number of tuples currently present */
      int            memtupsize;        /* allocated length of memtuples array */

*************** tuplestore_begin_common(int eflags, bool
*** 252,257 ****
--- 259,265 ----
      state->context = CurrentMemoryContext;
      state->resowner = CurrentResourceOwner;

+     state->memtupdeleted = 0;
      state->memtupcount = 0;
      state->memtupsize = 1024;    /* initial guess */
      state->memtuples = (void **) palloc(state->memtupsize * sizeof(void *));
*************** tuplestore_clear(Tuplestorestate *state)
*** 401,407 ****
      state->myfile = NULL;
      if (state->memtuples)
      {
!         for (i = 0; i < state->memtupcount; i++)
          {
              FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
              pfree(state->memtuples[i]);
--- 409,415 ----
      state->myfile = NULL;
      if (state->memtuples)
      {
!         for (i = state->memtupdeleted; i < state->memtupcount; i++)
          {
              FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
              pfree(state->memtuples[i]);
*************** tuplestore_clear(Tuplestorestate *state)
*** 409,414 ****
--- 417,423 ----
      }
      state->status = TSS_INMEM;
      state->truncated = false;
+     state->memtupdeleted = 0;
      state->memtupcount = 0;
      readptr = state->readptrs;
      for (i = 0; i < state->readptrcount; readptr++, i++)
*************** tuplestore_end(Tuplestorestate *state)
*** 432,438 ****
          BufFileClose(state->myfile);
      if (state->memtuples)
      {
!         for (i = 0; i < state->memtupcount; i++)
              pfree(state->memtuples[i]);
          pfree(state->memtuples);
      }
--- 441,447 ----
          BufFileClose(state->myfile);
      if (state->memtuples)
      {
!         for (i = state->memtupdeleted; i < state->memtupcount; i++)
              pfree(state->memtuples[i]);
          pfree(state->memtuples);
      }
*************** tuplestore_gettuple(Tuplestorestate *sta
*** 774,787 ****
                  }
                  else
                  {
!                     if (readptr->current <= 0)
                      {
                          Assert(!state->truncated);
                          return NULL;
                      }
                      readptr->current--; /* last returned tuple */
                  }
!                 if (readptr->current <= 0)
                  {
                      Assert(!state->truncated);
                      return NULL;
--- 783,796 ----
                  }
                  else
                  {
!                     if (readptr->current <= state->memtupdeleted)
                      {
                          Assert(!state->truncated);
                          return NULL;
                      }
                      readptr->current--; /* last returned tuple */
                  }
!                 if (readptr->current <= state->memtupdeleted)
                  {
                      Assert(!state->truncated);
                      return NULL;
*************** dumptuples(Tuplestorestate *state)
*** 969,975 ****
  {
      int            i;

!     for (i = 0;; i++)
      {
          TSReadPointer *readptr = state->readptrs;
          int            j;
--- 978,984 ----
  {
      int            i;

!     for (i = state->memtupdeleted;; i++)
      {
          TSReadPointer *readptr = state->readptrs;
          int            j;
*************** dumptuples(Tuplestorestate *state)
*** 984,989 ****
--- 993,999 ----
              break;
          WRITETUP(state, state->memtuples[i]);
      }
+     state->memtupdeleted = 0;
      state->memtupcount = 0;
  }

*************** tuplestore_trim(Tuplestorestate *state)
*** 1153,1176 ****
      nremove = oldest - 1;
      if (nremove <= 0)
          return;                    /* nothing to do */
      Assert(nremove <= state->memtupcount);

      /* Release no-longer-needed tuples */
!     for (i = 0; i < nremove; i++)
      {
          FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
          pfree(state->memtuples[i]);
      }

      /*
!      * Slide the array down and readjust pointers.    This may look pretty
!      * stupid, but we expect that there will usually not be very many
!      * tuple-pointers to move, so this isn't that expensive; and it keeps a
!      * lot of other logic simple.
       *
!      * In fact, in the current usage for merge joins, it's demonstrable that
!      * there will always be exactly one non-removed tuple; so optimize that
!      * case.
       */
      if (nremove + 1 == state->memtupcount)
          state->memtuples[0] = state->memtuples[nremove];
--- 1163,1198 ----
      nremove = oldest - 1;
      if (nremove <= 0)
          return;                    /* nothing to do */
+
+     Assert(nremove >= state->memtupdeleted);
      Assert(nremove <= state->memtupcount);

      /* Release no-longer-needed tuples */
!     for (i = state->memtupdeleted; i < nremove; i++)
      {
          FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
          pfree(state->memtuples[i]);
+         state->memtuples[i] = NULL;
      }
+     state->memtupdeleted = nremove;
+
+     /* mark tuplestore as truncated (used for Assert crosschecks only) */
+     state->truncated = true;

      /*
!      * If nremove is less than 1/8th memtupcount, just stop here, leaving the
!      * "deleted" slots as NULL.  This prevents us from expending O(N^2) time
!      * repeatedly memmove-ing a large pointer array.  The worst case space
!      * wastage is pretty small, since it's just pointers and not whole tuples.
!      */
!     if (nremove < state->memtupcount / 8)
!         return;
!
!     /*
!      * Slide the array down and readjust pointers.
       *
!      * In mergejoin's current usage, it's demonstrable that there will always
!      * be exactly one non-removed tuple; so optimize that case.
       */
      if (nremove + 1 == state->memtupcount)
          state->memtuples[0] = state->memtuples[nremove];
*************** tuplestore_trim(Tuplestorestate *state)
*** 1178,1192 ****
          memmove(state->memtuples, state->memtuples + nremove,
                  (state->memtupcount - nremove) * sizeof(void *));

      state->memtupcount -= nremove;
      for (i = 0; i < state->readptrcount; i++)
      {
          if (!state->readptrs[i].eof_reached)
              state->readptrs[i].current -= nremove;
      }
-
-     /* mark tuplestore as truncated (used for Assert crosschecks only) */
-     state->truncated = true;
  }

  /*
--- 1200,1212 ----
          memmove(state->memtuples, state->memtuples + nremove,
                  (state->memtupcount - nremove) * sizeof(void *));

+     state->memtupdeleted = 0;
      state->memtupcount -= nremove;
      for (i = 0; i < state->readptrcount; i++)
      {
          if (!state->readptrs[i].eof_reached)
              state->readptrs[i].current -= nremove;
      }
  }

  /*

Re: Why percent_rank is so slower than rank?

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm not sure whether or not to back-patch this into 9.0 and 8.4. 
> The code in tuplestore.c hasn't changed at all since 8.4, so
> there's not much risk of cross-version bugs, but if I did miss
> anything we could be shipping a buggy version next week. 
> Thoughts?
Is there a performance regression involved, or is it a new feature
which hasn't performed well on this type of query until your patch? 
If the latter, I'd be inclined to give it some time on HEAD and
release it in the *following* minor release unless you're *very*
confident it couldn't break anything.
It's an uphill battle to convince managers that it's safe to apply
minor upgrades with minimal testing.  It doesn't take to many slips
for the boulder to roll all the way back to the bottom of that hill.
-Kevin


Re: Why percent_rank is so slower than rank?

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm not sure whether or not to back-patch this into 9.0 and 8.4. 
>> The code in tuplestore.c hasn't changed at all since 8.4, so
>> there's not much risk of cross-version bugs, but if I did miss
>> anything we could be shipping a buggy version next week. 
>> Thoughts?
> Is there a performance regression involved, or is it a new feature
> which hasn't performed well on this type of query until your patch? 

Well, since window functions didn't exist before 8.4, it's difficult to
argue that there was a regression.  It's certainly a performance bug
though: nobody would expect that giving a query *more* work_mem would
cause it to run many times slower.

> If the latter, I'd be inclined to give it some time on HEAD and
> release it in the *following* minor release unless you're *very*
> confident it couldn't break anything.

Well, I'm reasonably confident in the patch, and it does pass regression
tests.  But I've been wrong before.

I'm not terribly thrilled with that suggestion though.  Do you have
reason to think that anybody is likely to exercise window functions in
HEAD, beyond what the regression tests do, in the next couple of months?
Slipping the application of the patch to back branches by a little bit
doesn't make a lot of management sense to me.  I think either we trust
it and put it into back branches, or we don't trust it and put it only
in HEAD, so it goes through a beta cycle before hitting production.
        regards, tom lane


Re: Why percent_rank is so slower than rank?

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Do you have reason to think that anybody is likely to exercise
> window functions in HEAD, beyond what the regression tests do, in
> the next couple of months?
Not specifically, no.  From the description (not having read the
patch) I was somewhat concerned that it might affect something
outside that narrow use case.  If that's not possible, then I'm more
comfortable putting it in a release that soon after it hits the
repository.
It's a judgment call, and you're clearly in the best position to
make it.  You asked for thoughts, so I shared my concerns.  :-)
-Kevin


Re: Why percent_rank is so slower than rank?

From
Kenneth Marshall
Date:
On Thu, Dec 09, 2010 at 05:18:57PM -0500, Tom Lane wrote:
> I wrote:
> > We're throwing away one tuple at a time as we advance forward through
> > the tuplestore, and moving 100000+ tuple pointers each time.  Ugh.
> > This code was all right when written, because (IIRC) the mergejoin
> > case was actually the only caller.  But it's not all right for
> > WindowAgg's less-predictable usage patterns.
> 
> > I thought for a bit about changing things around so that the first-used
> > tuple slot isn't necessarily state->memtuples[0], but just like the
> > comment says, that complicates a lot of other logic.  And there isn't
> > any easy place to reclaim the wasted slots later.
> 
> > What seems like the best bet is to put in a heuristic to make
> > tuplestore_trim simply not do anything until nremove reaches some
> > reasonably large amount, perhaps 10% of the number of stored tuples.
> > This wastes up to 10% of the alloted memory, but that seems tolerable.
> 
> On reflection I think just not doing anything isn't a very good idea.
> The problem with that is that a mis-coded caller could try to fetch
> tuples that it had already told the tuplestore could be trimmed away;
> and this would work, most of the time, until you got unlucky and the
> trim operation had actually deleted them.  I think it's pretty important
> for bug-catching purposes that the tuplestore enforce that those tuples
> are not available anymore.
> 
> Hence the attached patch, which combines the two ideas by recycling
> tuples immediately but not sliding the pointer array until a reasonable
> amount of movement has occurred.  This fixes the complained-of
> performance problem AFAICT.
> 
> I'm not sure whether or not to back-patch this into 9.0 and 8.4.  The
> code in tuplestore.c hasn't changed at all since 8.4, so there's not
> much risk of cross-version bugs, but if I did miss anything we could
> be shipping a buggy version next week.  Thoughts?
> 
>             regards, tom lane
> 

+1 for back patching.

Ken



Re: Why percent_rank is so slower than rank?

From
Hitoshi Harada
Date:
2010/12/10 Tom Lane <tgl@sss.pgh.pa.us>:
> I wrote:
>> We're throwing away one tuple at a time as we advance forward through
>> the tuplestore, and moving 100000+ tuple pointers each time.  Ugh.
>> This code was all right when written, because (IIRC) the mergejoin
>> case was actually the only caller.  But it's not all right for
>> WindowAgg's less-predictable usage patterns.
>
>> I thought for a bit about changing things around so that the first-used
>> tuple slot isn't necessarily state->memtuples[0], but just like the
>> comment says, that complicates a lot of other logic.  And there isn't
>> any easy place to reclaim the wasted slots later.
>
>> What seems like the best bet is to put in a heuristic to make
>> tuplestore_trim simply not do anything until nremove reaches some
>> reasonably large amount, perhaps 10% of the number of stored tuples.
>> This wastes up to 10% of the alloted memory, but that seems tolerable.
>
> On reflection I think just not doing anything isn't a very good idea.
> The problem with that is that a mis-coded caller could try to fetch
> tuples that it had already told the tuplestore could be trimmed away;
> and this would work, most of the time, until you got unlucky and the
> trim operation had actually deleted them.  I think it's pretty important
> for bug-catching purposes that the tuplestore enforce that those tuples
> are not available anymore.

I see it's too late now that you've committed it, but it seems there
was another way to avoid it by not trimming from percent_rank()
individually. Once the whole partition is fit to the memory, you don't
need to trim it since it never grows. The trimming logic is for
something like moving aggregates and (simple) rank(), which grows
tuplestore content as it advances. percent_rank() doesn't seem to
match the optimization.

Regards,

--
Hitoshi Harada


Re: Why percent_rank is so slower than rank?

From
Tom Lane
Date:
Hitoshi Harada <umi.tanuki@gmail.com> writes:
> I see it's too late now that you've committed it,

Patches can always be reverted...

> but it seems there
> was another way to avoid it by not trimming from percent_rank()
> individually. Once the whole partition is fit to the memory, you don't
> need to trim it since it never grows. The trimming logic is for
> something like moving aggregates and (simple) rank(), which grows
> tuplestore content as it advances. percent_rank() doesn't seem to
> match the optimization.

I don't think this idea leads to a robust solution.  When you have a
combination of different window functions being used in the same scan,
you can't expect any one of them to know the global situation.  Having
percent_rank lie about its requirements in order to avoid bad behavior
in the tuplestore infrastructure is just going to create more problems
down the road.  We need to have the individual functions tell the truth
and then do any optimization hacking in the WindowAgg code or
infrastructure.
        regards, tom lane


Re: Why percent_rank is so slower than rank?

From
Hitoshi Harada
Date:
2010/12/11 Tom Lane <tgl@sss.pgh.pa.us>:
> Hitoshi Harada <umi.tanuki@gmail.com> writes:
>> I see it's too late now that you've committed it,
>
> Patches can always be reverted...
>
>> but it seems there
>> was another way to avoid it by not trimming from percent_rank()
>> individually. Once the whole partition is fit to the memory, you don't
>> need to trim it since it never grows. The trimming logic is for
>> something like moving aggregates and (simple) rank(), which grows
>> tuplestore content as it advances. percent_rank() doesn't seem to
>> match the optimization.
>
> I don't think this idea leads to a robust solution.  When you have a
> combination of different window functions being used in the same scan,
> you can't expect any one of them to know the global situation.  Having
> percent_rank lie about its requirements in order to avoid bad behavior
> in the tuplestore infrastructure is just going to create more problems
> down the road.  We need to have the individual functions tell the truth
> and then do any optimization hacking in the WindowAgg code or
> infrastructure.

Hm? Once percent_rank() scans to the partition end, any other window
functions that scans row by row don't need to care the memory
reduction, aren't they? Or more generally, if the partition was
scanned to the end, we don't need to trim tuplestore anymore. Am I
misunderstanding?

Regards,

--
Hitoshi Harada


Re: Why percent_rank is so slower than rank?

From
Tom Lane
Date:
Hitoshi Harada <umi.tanuki@gmail.com> writes:
> Hm? Once percent_rank() scans to the partition end, any other window
> functions that scans row by row don't need to care the memory
> reduction, aren't they? Or more generally, if the partition was
> scanned to the end, we don't need to trim tuplestore anymore. Am I
> misunderstanding?

Giving back the memory as we do the scan is still a good thing IMO;
there might be other uses for it.  In any case I don't see where you're
going to put such a heuristic without breaking potentially interesting
uses elsewhere.  The tuplestore doesn't know anything about partitions
being read to the end; and WindowAgg doesn't (or shouldn't) know about
whether the tuplestore is all in memory.

Furthermore, the performance problem would exist for any situation where
the window functions had read far beyond the frame start, whether that
was all the way to partition end or not.  Consider a frame like ROWS
BETWEEN 10000 PRECEDING AND 10000 FOLLOWING.

In the end this is a local problem inside tuplestore, and kluging its
callers to work around it is the wrong approach.
        regards, tom lane


Re: Why percent_rank is so slower than rank?

From
Ron Mayer
Date:
Tom Lane wrote:
> argue that there was a regression.  It's certainly a performance bug
> though: nobody would expect that giving a query *more* work_mem would
> cause it to run many times slower.

I wouldn't be that surprised - otherwise it'd just be hard-coded to
something large.

Especially since earlier in the thread:
> The example is *not* particularly slow if you leave work_mem at default.
which makes me think it's arguably not quite a bug.