Home > mailing lists

Re: Getting an index scan to be a parallel index scan - Mailing list pgsql-performance

From	Thomas Munro
Subject	Re: Getting an index scan to be a parallel index scan
Date	February 1, 2023 21:51:38
Msg-id	CA+hUKGKFTpFc3_zi04iJ03nb9kuim4bUvgw-4s85j=FZ9REJRw@mail.gmail.com Whole thread
In response to	Getting an index scan to be a parallel index scan (Alex Kaiser <alextkaiser@gmail.com>)
Responses	Re: Getting an index scan to be a parallel index scan
List	pgsql-performance

Tree view

On Wed, Feb 1, 2023 at 6:39 PM Alex Kaiser <alextkaiser@gmail.com> wrote:
> select * from testing where id in (1608377,5449811, ... <1000 random ids> ,4654284,3558460);
>
> Essentially I have a list of 1000 ids and I would like the rows for all of those ids.
>
> This seems like it would be pretty easy to parallelize, if you have X threads then you would split the list of IDs
into1000/X sub lists and give one to each thread to go find the rows for ids in the given list.  Even when I use the
followingconfigs I don't get a query plan that actually uses any parallelism: 

It sounds like the plan you are imagining is something like:

Gather
  Nested Loop Join
    Outer side: <partial scan of your set of constant values>
    Inner side: Index scan of your big table

Such a plan would only give the right answer if each process has a
non-overlapping subset of the constant values to probe the index with,
and together they have the whole set.  Hypothetically, a planner could
chop that set up beforehand and and give a different subset to each
process (just as you could do that yourself using N connections and
separate queries), but that might be unfair: one process might find
lots of matches, and the others might find none, because of the
distribution of data.  So you'd ideally want some kind of "work
stealing" scheme, where each worker can take more values to probe from
whenever it needs more, so that they all keep working until the values
run out.  We don't have a thing that can do that.  You might imagine
that a CTE could do it, so WITH keys_to_look_up AS (VALUES (1), (2),
...) SELECT ... JOIN ON ..., but that also doesn't work because we
don't have a way to do "partial" scans of CTEs either (though someone
could invent that).  Likewise for temporary tables: they are invisible
to parallel workers, so they can't help us.  I have contemplated
"partial function scans" for set-returning functions, where a function
could be given a bit of shared memory and various other infrastructure
to be able to be "parallel aware" (= able to coordinate across
processes so that each process gets a subset of the data), and one
could imagine that that would allow various solutions to the problem,
but that's vapourware.

But you can get a plan like that if you insert all those values into a
regular table, depending on various settings, stats and
min_parallel_table_scan_size (try 0, I guess that'll definitely do
it).  Which probably isn't the answer you wanted to hear.

pgsql-performance by date:

From: Justin Pryzby
Date: 01 February 2023, 20:02:39
Subject: Re: Getting an index scan to be a parallel index scan

From: Alex Kaiser
Date: 02 February 2023, 00:54:17
Subject: Re: Getting an index scan to be a parallel index scan

Re: Getting an index scan to be a parallel index scan - Mailing list pgsql-performance

Previous

Next