We have a few slow queries that use sequential scans on tables that have
plenty of indexes (for other queries), on a box with a lot of RAM and 13
active cores (don't ask), so I was curious to find out how to put this
environment to better use. The result is (maybe) interesting, esp. since
PostgreSQL is getting better at executing many queries in parallel
lately and we will have more than 16 cores in typical servers very soon.
The simplified scenario was a query like
select * from t where foo ~ 'bla';
on a table with approx. 9m rows, taking around 12 seconds in the best
case. The table had a bigserial primary key "eid" with btree index,
which seemed to be the most suitable starting point.
The current value range of eid was partitioned into intervals of equal
size and depending on the number of rows in these intervals, one or more
of them were assigned to worker processes (this is done once per day,
not before each query!):
worker 1: select * from t where foo ~ 'bla' and (eid >= 300000 and eid <
worker 2: select * from t where foo ~ 'bla' and (eid >= 500000 and eid <
600000 or eid >= 1100000 and eid < 1200000 ...)
Instead of a sequential scan, a bunch of worker processes (implemented
with Gearman/Perl) would then execute queries (one each) with a plan like:
Bitmap Heap Scan on t ...
Recheck Cond: eid >= 300000 and eid < 400000 ..
Filter: foo ~ 'bla'
Bitmap Index Scan on t_pkey ...
This led to a speedup factor of 2 when the box was idle, i.e. the time
to split the query, distribute the jobs to worker processes, execute the
queries in parallel, collect results and send them back to the client
was now ~6 seconds.
- currently, the speedup is almost the same for anything between ~10 and
80+ workers (~12s down to ~7s on average, best run ~ 6s)
- the effective processing speed of the workers varied greatly (fastest
~3x to ~10x the rows/second of the slowest - real time divided by rows
to work on)
- the fastest workers went as fast as the sequential scans (in rows per
second) - sometimes, but not always (most likely they were actually
running alone then for some reason)
- in each new run, the workers finished in a completely different order
(even though they had the same parts of the table to work on and thus
identical queries) so perhaps the partitioning of the work load is quite
good already and it's more of a scheduling issue (Gearman? Shared buffer
- the Linux I/O scheduler had a visible effect, "noop" was better than
"deadline" (others not tried yet) ~10%, but this is typical for random
writes and RAID controllers that manage their writeback cache as they
like (it's a wasted effort to reorder writes before hitting the RAID
- CLUSTER might help a lot (the workers should hit fewer pages and need
fewer shared resources?) but I haven't tested it
- our query performance is not limited by disk I/O (as is usually the
case I guess), since we have most of the tables/indexes in RAM. Whether
it scales as well (or better?) with a proper disk subsystem and less
RAM, is unknown.
I hope there is some room for improvement so these queries can execute
faster in parallel for better scaling, these first results are quite
encouraging. I'd love to put 32+ cores to use for single queries.
Perhaps something like this could be built into PostgreSQL at some
point? There's no complicated multithreading/locking involved and
Postgres has enough statistics available to distribute work even better.
It should be easy to implement this for any of the various connection
pooling solutions also.
Has anyone done similar work in the light of upcoming many-core
CPUs/systems? Any better results than 2x improvement?
Apologies if this is a well-known and widely used technique already. ;-)
PS. yes, for the example query we could use tsearch2 etc., but it has
drawbacks in our specific case (indexing overhead, no real regexps
possible) and it's only an example anyway ...