Thread: Consolidate 'unique array values' logic into a reusable function?

Consolidate 'unique array values' logic into a reusable function?

From
Thomas Munro
Date:
Hi,

Looking at commits f10eab73d and c50d192c, I wondered why we don't
have a reusable in-place unique function.  It may be trivial, but we
seem to have a lot of copies and variations in the tree.

Here's a sketch patch that creates a function array_unique which takes
the same arguments as qsort or qsort_arg and returns the new length.
The patch replaces all the specialised unique functions and open coded
versions that I could find with simple greps, but there are probably
more.

My compiler seems to inline the comparator function and memcpy well,
so I can't measure any speed difference between array_unique(array,
size, sizeof(int), compare_int) and a hand-crafted loop using == for
comparison and = for assignment, for a billion items.

If no one objects I'll post a version of this to a commitfest, along
with some other trivial code duplication refactoring work I posted a
while back that consolidates popcount and ffs/fls implementations.  I
don't like code duplication :-)

--
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Consolidate 'unique array values' logic into a reusable function?

From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> Here's a sketch patch that creates a function array_unique which takes
> the same arguments as qsort or qsort_arg and returns the new length.

Hmm ... I'd be against using this in backend/regex/, because I still
have hopes of converting that to a standalone library someday (and
in any case it needs to stay compatible with Tcl's copy of the code).
But otherwise this seems like a reasonable proposal.

As for the function name, maybe "qunique()" to go with "qsort()"?
I'm not thrilled with "array_unique" because that sounds like it
is meant for Postgres' array data types.
        regards, tom lane



Re: Consolidate 'unique array values' logic into a reusable function?

From
Thomas Munro
Date:
Hello,

I'm reviving a thread from 2016, because I wanted this thing again today.

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > Here's a sketch patch that creates a function array_unique which takes
> > the same arguments as qsort or qsort_arg and returns the new length.
>
> Hmm ... I'd be against using this in backend/regex/, because I still
> have hopes of converting that to a standalone library someday (and
> in any case it needs to stay compatible with Tcl's copy of the code).
> But otherwise this seems like a reasonable proposal.
>
> As for the function name, maybe "qunique()" to go with "qsort()"?
> I'm not thrilled with "array_unique" because that sounds like it
> is meant for Postgres' array data types.

OK, here it is renamed to qunique() and qunique_arg().  It's a bit odd
because it has nothing to do with the quicksort algorithm, but make
some sense because it's always used with qsort().  I suppose we could
save a few more lines if there were a qsort_unique() function that
does both, since the arguments are identical.  I also moved it into a
new header lib/qunique.h.  Any better ideas for where it should live?
I removed the hunk under regex.

One thing I checked is that on my system it is inlined along with the
comparator when that is visible, so no performance should be lost by
throwing away the open coded versions.  This makes me think that eg
oid_cmp() should probably be defined in a header; clearly we're also
carrying a few functions that should be consolidated into a new
int32_cmp() function, somewhere, too.  (It might also be interesting
to use the pg_attribute_always_inline trick to instantiate some common
qsort() specialisations for a bit of speed-up, but that's another
topic.)

Adding to CF.

--
Thomas Munro
https://enterprisedb.com

Attachment

Re: Consolidate 'unique array values' logic into a reusable function?

From
Thomas Munro
Date:
On Fri, Aug 30, 2019 at 3:34 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Adding to CF.

Rebased due to bitrot.  Spotted one more place to use this, in
src/backend/utils/adt/txid.c.

-- 
Thomas Munro
https://enterprisedb.com

Attachment

Re: Consolidate 'unique array values' logic into a reusable function?

From
Thomas Munro
Date:
On Tue, Sep 10, 2019 at 11:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Rebased due to bitrot.  Spotted one more place to use this, in
> src/backend/utils/adt/txid.c.

Rebased.  I'm planning to commit this soon.

Attachment

Re: Consolidate 'unique array values' logic into a reusable function?

From
Noah Misch
Date:
On Mon, Nov 04, 2019 at 12:02:21PM +1300, Thomas Munro wrote:
> Rebased.  I'm planning to commit this soon.

In each installcheck-parallel run under valgrind-3.14.0, I now see ~1200
reports like this:

==00:00:00:28.322 1527557== Source and destination overlap in memcpy(0x1000104, 0x1000104, 4)
==00:00:00:28.322 1527557==    at 0x4C2E74D: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1035)
==00:00:00:28.322 1527557==    by 0xA9A57B: qunique (qunique.h:34)
==00:00:00:28.322 1527557==    by 0xA9A843: InitCatalogCache (syscache.c:1056)
==00:00:00:28.322 1527557==    by 0xAB6B18: InitPostgres (postinit.c:682)
==00:00:00:28.322 1527557==    by 0x91F98E: PostgresMain (postgres.c:3909)
==00:00:00:28.322 1527557==    by 0x872DE9: BackendRun (postmaster.c:4498)
==00:00:00:28.322 1527557==    by 0x8725B3: BackendStartup (postmaster.c:4189)
==00:00:00:28.322 1527557==    by 0x86E7F4: ServerLoop (postmaster.c:1727)
==00:00:00:28.322 1527557==    by 0x86E0AA: PostmasterMain (postmaster.c:1400)
==00:00:00:28.322 1527557==    by 0x77CB56: main (main.c:210)
==00:00:00:28.322 1527557== 
{
   <insert_a_suppression_name_here>
   Memcheck:Overlap
   fun:memcpy@@GLIBC_2.14
   fun:qunique
   fun:InitCatalogCache
   fun:InitPostgres
   fun:PostgresMain
   fun:BackendRun
   fun:BackendStartup
   fun:ServerLoop
   fun:PostmasterMain
   fun:main
}

This is like the problem fixed in 9a9473f; the precedent from there would be
to test src!=dst before calling mempcy(), e.g. as attached.  I suppose the
alternative would be to add a suppression like the one 9a9473f removed.

I do wonder why the Valgrind buildfarm animals haven't noticed.

Attachment

Re: Consolidate 'unique array values' logic into a reusable function?

From
Thomas Munro
Date:
On Sun, Dec 29, 2019 at 8:02 PM Noah Misch <noah@leadboat.com> wrote:
> ==00:00:00:28.322 1527557== Source and destination overlap in memcpy(0x1000104, 0x1000104, 4)
> ==00:00:00:28.322 1527557==    at 0x4C2E74D: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1035)
> ==00:00:00:28.322 1527557==    by 0xA9A57B: qunique (qunique.h:34)
> ==00:00:00:28.322 1527557==    by 0xA9A843: InitCatalogCache (syscache.c:1056)
> ==00:00:00:28.322 1527557==    by 0xAB6B18: InitPostgres (postinit.c:682)
> ==00:00:00:28.322 1527557==    by 0x91F98E: PostgresMain (postgres.c:3909)
> ==00:00:00:28.322 1527557==    by 0x872DE9: BackendRun (postmaster.c:4498)
> ==00:00:00:28.322 1527557==    by 0x8725B3: BackendStartup (postmaster.c:4189)
> ==00:00:00:28.322 1527557==    by 0x86E7F4: ServerLoop (postmaster.c:1727)
> ==00:00:00:28.322 1527557==    by 0x86E0AA: PostmasterMain (postmaster.c:1400)
> ==00:00:00:28.322 1527557==    by 0x77CB56: main (main.c:210)
> ==00:00:00:28.322 1527557==
> {
>    <insert_a_suppression_name_here>
>    Memcheck:Overlap
>    fun:memcpy@@GLIBC_2.14
>    fun:qunique
>    fun:InitCatalogCache
>    fun:InitPostgres
>    fun:PostgresMain
>    fun:BackendRun
>    fun:BackendStartup
>    fun:ServerLoop
>    fun:PostmasterMain
>    fun:main
> }
>
> This is like the problem fixed in 9a9473f; the precedent from there would be
> to test src!=dst before calling mempcy(), e.g. as attached.  I suppose the
> alternative would be to add a suppression like the one 9a9473f removed.

Thanks for fixing that.

> I do wonder why the Valgrind buildfarm animals haven't noticed.

Optimisation levels?  For example, I see that skink is using -Og, at
which level my local GCC inlines qunique() and memcpy() so that in the
case you quoted there's a MOV instruction and valgrind has nothing to
complain about.



Re: Consolidate 'unique array values' logic into a reusable function?

From
Noah Misch
Date:
On Wed, Jan 08, 2020 at 02:49:48PM +1300, Thomas Munro wrote:
> On Sun, Dec 29, 2019 at 8:02 PM Noah Misch <noah@leadboat.com> wrote:
> > ==00:00:00:28.322 1527557== Source and destination overlap in memcpy(0x1000104, 0x1000104, 4)
> > ==00:00:00:28.322 1527557==    at 0x4C2E74D: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1035)
> > ==00:00:00:28.322 1527557==    by 0xA9A57B: qunique (qunique.h:34)
> > ==00:00:00:28.322 1527557==    by 0xA9A843: InitCatalogCache (syscache.c:1056)
> > ==00:00:00:28.322 1527557==    by 0xAB6B18: InitPostgres (postinit.c:682)
> > ==00:00:00:28.322 1527557==    by 0x91F98E: PostgresMain (postgres.c:3909)
> > ==00:00:00:28.322 1527557==    by 0x872DE9: BackendRun (postmaster.c:4498)
> > ==00:00:00:28.322 1527557==    by 0x8725B3: BackendStartup (postmaster.c:4189)
> > ==00:00:00:28.322 1527557==    by 0x86E7F4: ServerLoop (postmaster.c:1727)
> > ==00:00:00:28.322 1527557==    by 0x86E0AA: PostmasterMain (postmaster.c:1400)
> > ==00:00:00:28.322 1527557==    by 0x77CB56: main (main.c:210)
> > ==00:00:00:28.322 1527557==
> > {
> >    <insert_a_suppression_name_here>
> >    Memcheck:Overlap
> >    fun:memcpy@@GLIBC_2.14
> >    fun:qunique
> >    fun:InitCatalogCache
> >    fun:InitPostgres
> >    fun:PostgresMain
> >    fun:BackendRun
> >    fun:BackendStartup
> >    fun:ServerLoop
> >    fun:PostmasterMain
> >    fun:main
> > }
> >
> > This is like the problem fixed in 9a9473f; the precedent from there would be
> > to test src!=dst before calling mempcy(), e.g. as attached.  I suppose the
> > alternative would be to add a suppression like the one 9a9473f removed.
> 
> Thanks for fixing that.

Glad to help.

> > I do wonder why the Valgrind buildfarm animals haven't noticed.
> 
> Optimisation levels?  For example, I see that skink is using -Og, at
> which level my local GCC inlines qunique() and memcpy() so that in the
> case you quoted there's a MOV instruction and valgrind has nothing to
> complain about.

That explains it.  I would have been using -O0.  (It would be a nice bonus to
have both an -O0 Valgrind buildfarm animal and an optimized Valgrind animal,
since neither catches everything.)