Re: Improvement for query planner? (no, not about count(*) again ;-)) - Mailing list pgsql-bugs

From Francisco Olarte
Subject Re: Improvement for query planner? (no, not about count(*) again ;-))
Date
Msg-id CA+bJJbxEuHXV4oiVrB54y3-tBgd+cwmMCFUHTFH2VB8BcfTFdA@mail.gmail.com
Whole thread Raw
In response to Improvement for query planner? (no, not about count(*) again ;-))  (Tobias Völk <tobias.voelk@t-online.de>)
List pgsql-bugs
Tobias:

On Mon, Jul 20, 2020 at 12:09 PM Tobias Völk <tobias.voelk@t-online.de> wrote:
...
> Insert into newtable(name) select name1 from games on conflict do nothing;
> (and later on intended to do the same for the second column)
> However after hours it still wasn’t done, used only 1 cpu core to the max and read with 5 MB/s from my fast SSD.
> I’ve also tried inserting (select name1 from games union select name2 from games) but it always wanted to do it using
sorting.
> But either the sorting or the preperations for the sorting were again only done using 1 core to the max and reading
with5 MB/s. 
> Couldn’t find a fast query for my problem.
> So I wrote a java-program which read the whole table at a fetchsize of about 4 million and inserted the names into a
HashSet.
> And surprisingly after only a few minutes the program was already 25% done o.O

Not surprising, 1.3E9 rows, lets say 400M for 25%, SSD, if your net is
fast enough pg should be able to send you a million rows a second to
do that.

Have you tried doing a similar thing in postgres, like "select
distinct name1", or select distinct name1 union select distinct
name2". The distinct part is the equivalent of putting everything in a
hash set.

The part of doing it using sorting, maybe you do not have enough
work_mem or other things, but it is probably the right way to do it
under the constraints you have put on the engine, but I would not
bother with that before timing. I routinely sort huge files ( not in
pg ) spilling to disk ( they are about a hundred times available RAM )
sort a few gigabytes, spill, read and merge in multi megabytes chunks,
and it only makes a constant factor (2, IIRC, ram sort is
read+sort+write, spill is read+write chunks + read chunks + write   )
when testing against pure ram ( with a file which fit in ram ).

> My question is, why isn’t postgres nearly this fast? Why doesn’t it just create a HashSet in RAM and read full speed
fromthe disk? 

The on conflict version may be slow because it is not optimized for
this kind of things, and is doing nothing but testing for conflict on
every row ( which may be needed ). Also, you have set a primary key
before a bulk load, which is a big no-no as pg has to build the index
as it loads.

I would try to do the equivalent of the hash set, create the table
without the PK, then try something like "select distint name1 union
select distinct name2", which is similar to build two hashsets and
collapse them, then add the primary key afterwards. Test it in steps.

> I even created a hash index but it kept using it’s primary key b-tree and then I read that hash indices somehow don’t
supportchecking for uniqueness. 

Also more indexes => slower loading.

Francisco Olarte.



pgsql-bugs by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Buffers from parallel workers not accumulated to upper nodes with gather merge
Next
From: Ruslan Mukhamedov
Date:
Subject: apt-get -yq purge postgresql-common shows interactive dialog