Thread: Can't get two index scans

Can't get two index scans

From
Craig James
Date:
I'm working with a third-party plugin that does chemistry. It's very fast. However, I'm trying to do a sampling query, such as the first 1% of the database, and I just can't get the planner to create a good plan.  Here is the full query (the |>| operator does a subgraph match of a molecular substructure, in this case benzene, to find all molecules that have a benzene ring in the database):

explain analyze select * from version where smiles |>| 'c1ccccc1';
 ...
 Index Scan using i_version_smiles on version  (cost=3445.75..147094.03 rows=180283 width=36) (actual time=336.493..10015.753
 rows=180973 loops=1)
   Index Cond: (smiles |>| 'c1ccccc1'::molecule)
 Planning time: 1.228 ms
 Execution time: 10371.903 ms

Ten seconds over 263,000 molecules, which is actually good. Now let's limit it to the first 1% of the rows:

explain analyze select * from version where smiles |>| 'c1ccccc1' and version_id < 897630;
...
 Index Scan using pk_version on version  (cost=0.42..131940.05 rows=1643 width=36) (actual time=6.122..2816.298 rows=2039 loops=1)
   Index Cond: (version_id < 897630)
   Filter: (smiles |>| 'c1ccccc1'::molecule)
   Rows Removed by Filter: 590
 Planning time: 1.217 ms
 Execution time: 2822.117 ms

Notice that it doesn't use the i_version_smiles index at all, but instead applies the very expensive filter |>| to all 1% of the database. So instead of getting a 100x speedup, we only get a 3x speedup, about 30x worse that what is theoretically possible.

The production database is about 50x larger than this test database.

Maybe I misunderstand what's possible with indexes, but it seems to me that it could first do the pk_version index scan, and then use the results of that to do a limited index-scan search using the i_version_smiles index. Is that not possible? Is each index scan "self contained", that is, it doesn't take into account the results of another index scan?

Thanks,
Craig

Re: Can't get two index scans

From
Jeff Janes
Date:
On Wed, Jun 22, 2016 at 9:03 AM, Craig James <cjames@emolecules.com> wrote:
> I'm working with a third-party plugin that does chemistry.


Out of personal/professional curiosity, which one are you using, if
that can be disclosed?

....


> Notice that it doesn't use the i_version_smiles index at all, but instead
> applies the very expensive filter |>| to all 1% of the database.

You have to tell the database that |>| is very expensive, by setting
the COST of the function which it invokes.  You can get the name of
the function with:

select oprcode from pg_operator where oprname ='|>|' ;

(taking care for schema and overloading, etc.)

I would set the COST to at least 1000, probably more.

> So instead
> of getting a 100x speedup, we only get a 3x speedup, about 30x worse that
> what is theoretically possible.
>
> The production database is about 50x larger than this test database.
>
> Maybe I misunderstand what's possible with indexes, but it seems to me that
> it could first do the pk_version index scan, and then use the results of
> that to do a limited index-scan search using the i_version_smiles index. Is
> that not possible?

I don't think it can do that.  What it can do is run each index scan
to completion as a bitmap index scan, and then AND the bitmaps
together.

You might be able to build a multiple column index on (smiles,
version_id) and have it do the right thing automatically. Whether that
is possible, and if so how effective it will actually be, would depend
on the implementation details of |>|. My gut feeling is that it would
not work well.

You could partition your data on version_id.  Then it would keep a
separate smiles index on each partition, and would only consult those
indexes which can possibly contain (according to the CHECK
constraints) the version_ids of interest in the query.

Also, if you tune your system using benzene, you will be probably
arrive at a place not optimal for more realistic queries.

Cheers,

Jeff


Re: Can't get two index scans

From
Craig James
Date:
On Wed, Jun 22, 2016 at 11:36 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Jun 22, 2016 at 9:03 AM, Craig James <cjames@emolecules.com> wrote:
> I'm working with a third-party plugin that does chemistry.


Out of personal/professional curiosity, which one are you using, if
that can be disclosed?

ChemAxon (JChem)
 
> Notice that it doesn't use the i_version_smiles index at all, but instead
> applies the very expensive filter |>| to all 1% of the database.

You have to tell the database that |>| is very expensive, by setting
the COST of the function which it invokes.  You can get the name of
the function with:

select oprcode from pg_operator where oprname ='|>|' ;

(taking care for schema and overloading, etc.)

I would set the COST to at least 1000, probably more.

I'll try this. I've done it with my own functions, but didn't realize you could do it with existing operators.


> So instead
> of getting a 100x speedup, we only get a 3x speedup, about 30x worse that
> what is theoretically possible.
>
> The production database is about 50x larger than this test database.
>
> Maybe I misunderstand what's possible with indexes, but it seems to me that
> it could first do the pk_version index scan, and then use the results of
> that to do a limited index-scan search using the i_version_smiles index. Is
> that not possible?

I don't think it can do that.  What it can do is run each index scan
to completion as a bitmap index scan, and then AND the bitmaps
together.

That won't help in this case because the index scan of the molecule table can be slow.
 

You might be able to build a multiple column index on (smiles,
version_id) and have it do the right thing automatically. Whether that
is possible, and if so how effective it will actually be, would depend
on the implementation details of |>|. My gut feeling is that it would
not work well.

No, because it's not a normal exact-match query. The analogy would be that you can build a multi-column index for an '=' operation on a string, but it wouldn't help if you were doing an '~' or 'LIKE' operation.
 
You could partition your data on version_id.  Then it would keep a
separate smiles index on each partition, and would only consult those
indexes which can possibly contain (according to the CHECK
constraints) the version_ids of interest in the query.

I actually struck on this solution today and it works well. Instead partitioning on the version_id, I added a column "p" ("partition") and used 20 partitions where p is a random number from 0..19. This has the advantage that as new compounds are added, they are distributed throughout the partitions, so each partition remains a 5% sample of the whole.

It's pretty cool. A full-table scan of all partitions is slightly slower, but if I want to do a sample and limit the run time, I can query with p = 0.

It also has another huge benefit for a web site: I can give the user a progress-bar widget by querying the partitions one-by-one and updating the progress in 5% increments. This is really critical for long-running queries. 


Also, if you tune your system using benzene, you will be probably
arrive at a place not optimal for more realistic queries.

No, it's actually very useful. I'm not interested in optimizing typical queries, but rather in limiting worst-case queries. This is a public web site, and you never know what molecule someone will draw. In fact, it's quite common for visitors to draw silly molecules like benzine or methane that would result in a heavy load if left to run to completion.

Thanks for your help!
Craig


Cheers,

Jeff



--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------

Re: Can't get two index scans

From
Jeff Janes
Date:
On Wed, Jun 22, 2016 at 9:36 PM, Craig James <cjames@emolecules.com> wrote:
> On Wed, Jun 22, 2016 at 11:36 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

>> You might be able to build a multiple column index on (smiles,
>> version_id) and have it do the right thing automatically. Whether that
>> is possible, and if so how effective it will actually be, would depend
>> on the implementation details of |>|. My gut feeling is that it would
>> not work well.
>
>
> No, because it's not a normal exact-match query. The analogy would be that
> you can build a multi-column index for an '=' operation on a string, but it
> wouldn't help if you were doing an '~' or 'LIKE' operation.

That restriction only applies to BTREE indexes.  GiST and GIN indexes
work differently, and don't have that particular limitation.  They can
use the second column of the index even if the first column is not
used, or (in the case of GiST at least) the first column is used with
an operator other than equality.

The main problems I've run into with GiST indexes is that they
sometimes take absurdly long times to build; and that the
split-picking algorithm might arrive at buckets ill-suited to your
queries so that the consultation of the index "works" in the sense
that it discards most of the non-matching rows without inspecting
them, but isn't actually faster. Unfortunately, both of these problems
seem hard to predict.  You pretty much have to try it (on a full-size
data set, as scaling up from toy data sets is also hard to predict)
and see how it does.

But, JChem's cartridge is apparently not using a GiST index, which is
what my first guess was.  I can't really figure out what PostgreSQL
API it is tapping into, so whatever it is very well might not support
multi-column indexes at all.

>> You could partition your data on version_id.  Then it would keep a
>> separate smiles index on each partition, and would only consult those
>> indexes which can possibly contain (according to the CHECK
>> constraints) the version_ids of interest in the query.
>
>
> I actually struck on this solution today and it works well. Instead
> partitioning on the version_id, I added a column "p" ("partition") and used
> 20 partitions where p is a random number from 0..19. This has the advantage
> that as new compounds are added, they are distributed throughout the
> partitions, so each partition remains a 5% sample of the whole.
>
> It's pretty cool. A full-table scan of all partitions is slightly slower,
> but if I want to do a sample and limit the run time, I can query with p = 0.
>
> It also has another huge benefit for a web site: I can give the user a
> progress-bar widget by querying the partitions one-by-one and updating the
> progress in 5% increments. This is really critical for long-running queries.

That does sound pretty useful.  You could potentially get the same
benefit with the multicolumn GiST index, without needing to partition
the table.  In a vague hand-wavy way, building an index "USING GIST
(p, smiles jchem_op_class)" is like using p to automatically partition
the index so it acts like individual indexes over smiles for each
value of p.  But it is unlikely to ever be as efficient as
well-crafted explicit partitions, and once you have gone to the effort
of setting them up there would probably be no point in trying to
change over.


>> Also, if you tune your system using benzene, you will be probably
>> arrive at a place not optimal for more realistic queries.
>
>
> No, it's actually very useful. I'm not interested in optimizing typical
> queries, but rather in limiting worst-case queries. This is a public web
> site, and you never know what molecule someone will draw. In fact, it's
> quite common for visitors to draw silly molecules like benzine or methane
> that would result in a heavy load if left to run to completion.

My benefit in having a non-public web site, is that I can just walk
over to their desk and yell at the people who do things like that to
my database.

(And I promise to stop searching for methane on your web site.)

Cheers,

Jeff


Re: Can't get two index scans

From
Craig James
Date:
On Thu, Jun 23, 2016 at 8:47 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Jun 22, 2016 at 9:36 PM, Craig James <cjames@emolecules.com> wrote:
> On Wed, Jun 22, 2016 at 11:36 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
...
But, JChem's cartridge is apparently not using a GiST index, which is
what my first guess was.  I can't really figure out what PostgreSQL
API it is tapping into, so whatever it is very well might not support
multi-column indexes at all.

They run a separate chemistry server process, which I believe is a Java app. My guess (and it's strictly a guess) is that they use classic chemical bitmap fingerprints, and the index scan runs in their separate process and only returns the index-scan results.

Craig