Thread: how to optimize my c-extension functions

how to optimize my c-extension functions

From
TJ O'Donnell
Date:
I've written a c-language extension to postgresql to implement a
chemical search of a varchar column (named smiles, typically).
It might be called as:
  oe_matches(smiles,'COCC') where 'COCC' is a typical search string.
This uses 3rd party functions to parse the varchar inputs into c++
objects.  In order to optimize this search, I'd like to parse the whole
table containing smiles just once, store it and use the parsed data
instead of the varchar smiles.

So, I could create another column, say smiles_ob and have the
oe_matches function take that column.  I'd have to be sure the smiles_ob
column was related (by a trigger?) to the smiles column.
But I was thinking I might be able to hide these parsed objects from the
user by somehow incoporating the parsed objects into a type of index.
I'd like also to use additional columns (like molecular formula) in
the match function to "triage" the table to quickly rule out impossible
matches, before doing a full search/match.

Am I way off the track here?  Is it a bad idea to clutter the index
with things like this?  Is it possible?   Is there another
approach that could hide some of these details from the user - meaning
they would not have to create/update these additional columns?

Thanks,
TJ O'Donnell

Re: how to optimize my c-extension functions

From
Pierre-Frédéric Caillaud
Date:
    I gather your program uses two steps, let's call them :
    - parse( smiles ) -> data
    - search( data ) -> result

    You can create a functional index on your smiles column, but I don't know
if this will help you ; you can do things like CREATE INDEX ... ON
mytable( lower( myfield )), then SELECT ... FROM mytable WHERE
lower(myfield) = something, BUT in your case I gather your search function
which processes the parsed data does a lot more than just dome character
match, so creating a functional index on parse(smile) would be useless for
selecting on search(parse(smile))...

    So, in any case, if the parsing phase is slow, you can store the
preparsed data in a text or binary field and search it directly, but this
will not be indexed.

    If you can map a subset of your searchable properties to simple
datatypes, you could do a first search for related matches, as you said.

    You say nothing about how your system works internally, whta kind of
representation is it and what kind of searches do you actually do ?


On Sat, 08 Jan 2005 15:50:06 -0800, TJ O'Donnell <tjo@acm.org> wrote:

> I've written a c-language extension to postgresql to implement a
> chemical search of a varchar column (named smiles, typically).
> It might be called as:
>   oe_matches(smiles,'COCC') where 'COCC' is a typical search string.
> This uses 3rd party functions to parse the varchar inputs into c++
> objects.  In order to optimize this search, I'd like to parse the whole
> table containing smiles just once, store it and use the parsed data
> instead of the varchar smiles.
>
> So, I could create another column, say smiles_ob and have the
> oe_matches function take that column.  I'd have to be sure the smiles_ob
> column was related (by a trigger?) to the smiles column.
> But I was thinking I might be able to hide these parsed objects from the
> user by somehow incoporating the parsed objects into a type of index.
> I'd like also to use additional columns (like molecular formula) in
> the match function to "triage" the table to quickly rule out impossible
> matches, before doing a full search/match.
>
> Am I way off the track here?  Is it a bad idea to clutter the index
> with things like this?  Is it possible?   Is there another
> approach that could hide some of these details from the user - meaning
> they would not have to create/update these additional columns?
>
> Thanks,
> TJ O'Donnell
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if
> your
>       joining column's datatypes do not match
>



Re: how to optimize my c-extension functions

From
TJ O'Donnell
Date:
Yes, my c function and it's sql counterpart, oe_matches(smiles)
uses two steps (1) parse smiles (2) search parsed smiles.
Parsing is expensive.  The smiles has an external string representation,
which is stored in a smiles column, but only the parsed form is actually
searchable.
The smiles representation is never searched in a traditional string
manner, except perhaps for a direct lookup (string equals).
LIKE has no meaning for smiles strings, similarly < or > are
meaningless.

Smiles is parsed into atom and bond representations using
3rd party c++ code/methods.  I simply call their methods
to parse and search.  A binary string can be got from them
for persistent storage in a postgres column.  It can then be
restored into a c++ object for searching, thus avoiding the
parsing stage, except for the initial creation of a row with
a smiles column.

My goal here is to optimize the search by storing the parsed smiles,
YET HIDE THIS FROM THE USER.  I thought I might be able to store
the parsed smiles in an index for me to use while searching, even
though it would not be used for indexing in the traditional manner.
This would mean creating a new indexing method.  I've read up on this
and it seems a daunting task.  Am I perverting the index method if
I try to do this?

So, aside from having the user be responsible for maintaining a
column of parsed smiles (and possibly other related columns which
would speed up the search), is there a way I can create and maintain
a table related to the table containing the smiles - and all
behind the scenes so the sql user is unaware of this.
My thought was that an index is like that and I might borrow some
of the internal uses of indexing for my purposes.

TJ O'Donnell
tjo@acm.org

Pierre-Frédéric Caillaud wrote:
>     I gather your program uses two steps, let's call them :
>     - parse( smiles ) -> data
>     - search( data ) -> result
>
>     You can create a functional index on your smiles column, but I don't
> know  if this will help you ; you can do things like CREATE INDEX ...
> ON  mytable( lower( myfield )), then SELECT ... FROM mytable WHERE
> lower(myfield) = something, BUT in your case I gather your search
> function  which processes the parsed data does a lot more than just dome
> character  match, so creating a functional index on parse(smile) would
> be useless for  selecting on search(parse(smile))...
>
>     So, in any case, if the parsing phase is slow, you can store the
> preparsed data in a text or binary field and search it directly, but
> this  will not be indexed.
>
>     If you can map a subset of your searchable properties to simple
> datatypes, you could do a first search for related matches, as you said.
>
>     You say nothing about how your system works internally, whta kind
> of  representation is it and what kind of searches do you actually do ?
>
>
> On Sat, 08 Jan 2005 15:50:06 -0800, TJ O'Donnell <tjo@acm.org> wrote:
>
>> I've written a c-language extension to postgresql to implement a
>> chemical search of a varchar column (named smiles, typically).
>> It might be called as:
>>   oe_matches(smiles,'COCC') where 'COCC' is a typical search string.
>> This uses 3rd party functions to parse the varchar inputs into c++
>> objects.  In order to optimize this search, I'd like to parse the
>> whole  table containing smiles just once, store it and use the parsed
>> data  instead of the varchar smiles.
>>
>> So, I could create another column, say smiles_ob and have the
>> oe_matches function take that column.  I'd have to be sure the
>> smiles_ob  column was related (by a trigger?) to the smiles column.
>> But I was thinking I might be able to hide these parsed objects from
>> the  user by somehow incoporating the parsed objects into a type of
>> index.
>> I'd like also to use additional columns (like molecular formula) in
>> the match function to "triage" the table to quickly rule out impossible
>> matches, before doing a full search/match.
>>
>> Am I way off the track here?  Is it a bad idea to clutter the index
>> with things like this?  Is it possible?   Is there another
>> approach that could hide some of these details from the user - meaning
>> they would not have to create/update these additional columns?
>>
>> Thanks,
>> TJ O'Donnell
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 9: the planner will ignore your desire to choose an index scan if
>> your
>>       joining column's datatypes do not match
>>
>

Re: how to optimize my c-extension functions

From
TJ O'Donnell
Date:
To add to my last followup posting, the only way I use
oe_mathces(smiles) is in something like the following:

Select smiles,id from structure where oe_matches(smiles,'CCOC');

The match string 'CCOC' in this case, varies widely according to
the needs of the user during that session.
It is analogous to regexp matching, except that the semantics
of oe_matches is nothing at all like string matching, even though
smiles is actually a character string.  Character string smiles
are simply an extenral representation of a more complex c++ molecular
structure object.

TJ


Pierre-Frédéric Caillaud wrote:
 >     You say nothing about how your system works internally, whta kind
 > of  representation is it and what kind of searches do you actually do ?
 >
 >


Re: how to optimize my c-extension functions

From
Pierre-Frédéric Caillaud
Date:
    Well, first and easy thing you can do is create a column to store the
parsed representation and update it via a trigger when the original,
unparsed column is updated or inserted.
    Is this sufficiently "hidden from the user" for you ? I know it's not
really hidden, but the fact that updating is automatic could be good
enough.
    You could also store this column in another table and join with the main
table.

    What are the kinds of searches you do more often ? Can you give a few
examples ?


> Yes, my c function and it's sql counterpart, oe_matches(smiles)
> uses two steps (1) parse smiles (2) search parsed smiles.
> Parsing is expensive.  The smiles has an external string representation,
> which is stored in a smiles column, but only the parsed form is actually
> searchable.
> The smiles representation is never searched in a traditional string
> manner, except perhaps for a direct lookup (string equals).
> LIKE has no meaning for smiles strings, similarly < or > are
> meaningless.
>
> Smiles is parsed into atom and bond representations using
> 3rd party c++ code/methods.  I simply call their methods
> to parse and search.  A binary string can be got from them
> for persistent storage in a postgres column.  It can then be
> restored into a c++ object for searching, thus avoiding the
> parsing stage, except for the initial creation of a row with
> a smiles column.
>
> My goal here is to optimize the search by storing the parsed smiles,
> YET HIDE THIS FROM THE USER.  I thought I might be able to store
> the parsed smiles in an index for me to use while searching, even
> though it would not be used for indexing in the traditional manner.
> This would mean creating a new indexing method.  I've read up on this
> and it seems a daunting task.  Am I perverting the index method if
> I try to do this?
>
> So, aside from having the user be responsible for maintaining a
> column of parsed smiles (and possibly other related columns which
> would speed up the search), is there a way I can create and maintain
> a table related to the table containing the smiles - and all
> behind the scenes so the sql user is unaware of this.
> My thought was that an index is like that and I might borrow some
> of the internal uses of indexing for my purposes.
>
> TJ O'Donnell
> tjo@acm.org
>
> Pierre-Frédéric Caillaud wrote:
>>     I gather your program uses two steps, let's call them :
>>     - parse( smiles ) -> data
>>     - search( data ) -> result
>>      You can create a functional index on your smiles column, but I
>> don't know  if this will help you ; you can do things like CREATE INDEX
>> ... ON  mytable( lower( myfield )), then SELECT ... FROM mytable WHERE
>> lower(myfield) = something, BUT in your case I gather your search
>> function  which processes the parsed data does a lot more than just
>> dome character  match, so creating a functional index on parse(smile)
>> would be useless for  selecting on search(parse(smile))...
>>      So, in any case, if the parsing phase is slow, you can store the
>> preparsed data in a text or binary field and search it directly, but
>> this  will not be indexed.
>>      If you can map a subset of your searchable properties to simple
>> datatypes, you could do a first search for related matches, as you said.
>>      You say nothing about how your system works internally, whta kind
>> of  representation is it and what kind of searches do you actually do ?
>>      On Sat, 08 Jan 2005 15:50:06 -0800, TJ O'Donnell <tjo@acm.org>
>> wrote:
>>
>>> I've written a c-language extension to postgresql to implement a
>>> chemical search of a varchar column (named smiles, typically).
>>> It might be called as:
>>>   oe_matches(smiles,'COCC') where 'COCC' is a typical search string.
>>> This uses 3rd party functions to parse the varchar inputs into c++
>>> objects.  In order to optimize this search, I'd like to parse the
>>> whole  table containing smiles just once, store it and use the parsed
>>> data  instead of the varchar smiles.
>>>
>>> So, I could create another column, say smiles_ob and have the
>>> oe_matches function take that column.  I'd have to be sure the
>>> smiles_ob  column was related (by a trigger?) to the smiles column.
>>> But I was thinking I might be able to hide these parsed objects from
>>> the  user by somehow incoporating the parsed objects into a type of
>>> index.
>>> I'd like also to use additional columns (like molecular formula) in
>>> the match function to "triage" the table to quickly rule out impossible
>>> matches, before doing a full search/match.
>>>
>>> Am I way off the track here?  Is it a bad idea to clutter the index
>>> with things like this?  Is it possible?   Is there another
>>> approach that could hide some of these details from the user - meaning
>>> they would not have to create/update these additional columns?
>>>
>>> Thanks,
>>> TJ O'Donnell
>>>
>>> ---------------------------(end of
>>> broadcast)---------------------------
>>> TIP 9: the planner will ignore your desire to choose an index scan if
>>> your
>>>       joining column's datatypes do not match
>>>
>>
>



Re: how to optimize my c-extension functions

From
TJ O'Donnell
Date:
Let me first say that I will routinely be dealing with
one million+ rows, so I want to take care to optimize
my work as much as possible, and to consider carefully
my design decisions.

The only type of search will be of the type:

Select smiles,id from structure where  oe_matches(smiles,'c1ccccc1C(=O)N');

or joins with other tables e.g.

Select id,smiles,clogp from structure,properties where
oe_matches(smiles,'c1ccccc1C(=O)N') and id = prop_id;
with id being a sequence (hence unique) and foreign
key prop_id column of properties.

There may be other useful functions of smiles, e.g.
int oe_count_matches(smiles,'CCC'),
and these would also prefer to use the pre-parsed smiles
c++ object.

After I parse the smiles,
the character string smiles is really of no use anymore.
It might be output, for example to an external program such as
smiles_to_jpeg which re-parses the smiles and need not be fast.
So, there is no real use for indexing smiles.  So I want to
borrow the internal tables normally used for indexing to store
my parsed smiles and use the parsed smiles in oe_matches and other
functions.
If I do this, maybe I would have to use operators (=,<,>,LIKE?)
to do the matching.  A c-function is simply called with data and
would have no access to indexes, correct?

TJ


Pierre-Frédéric Caillaud wrote:
>
>     Well, first and easy thing you can do is create a column to store
> the  parsed representation and update it via a trigger when the
> original,  unparsed column is updated or inserted.
>     Is this sufficiently "hidden from the user" for you ? I know it's
> not  really hidden, but the fact that updating is automatic could be
> good  enough.
>     You could also store this column in another table and join with the
> main  table.
>
>     What are the kinds of searches you do more often ? Can you give a
> few  examples ?
>
>
>> Yes, my c function and it's sql counterpart, oe_matches(smiles)
>> uses two steps (1) parse smiles (2) search parsed smiles.
>> Parsing is expensive.  The smiles has an external string representation,
>> which is stored in a smiles column, but only the parsed form is
>> actually  searchable.
>> The smiles representation is never searched in a traditional string
>> manner, except perhaps for a direct lookup (string equals).
>> LIKE has no meaning for smiles strings, similarly < or > are
>> meaningless.
>>
>> Smiles is parsed into atom and bond representations using
>> 3rd party c++ code/methods.  I simply call their methods
>> to parse and search.  A binary string can be got from them
>> for persistent storage in a postgres column.  It can then be
>> restored into a c++ object for searching, thus avoiding the
>> parsing stage, except for the initial creation of a row with
>> a smiles column.
>>
>> My goal here is to optimize the search by storing the parsed smiles,
>> YET HIDE THIS FROM THE USER.  I thought I might be able to store
>> the parsed smiles in an index for me to use while searching, even
>> though it would not be used for indexing in the traditional manner.
>> This would mean creating a new indexing method.  I've read up on this
>> and it seems a daunting task.  Am I perverting the index method if
>> I try to do this?
>>
>> So, aside from having the user be responsible for maintaining a
>> column of parsed smiles (and possibly other related columns which
>> would speed up the search), is there a way I can create and maintain
>> a table related to the table containing the smiles - and all
>> behind the scenes so the sql user is unaware of this.
>> My thought was that an index is like that and I might borrow some
>> of the internal uses of indexing for my purposes.
>>
>> TJ O'Donnell
>> tjo@acm.org
>>
>> Pierre-Frédéric Caillaud wrote:
>>
>>>     I gather your program uses two steps, let's call them :
>>>     - parse( smiles ) -> data
>>>     - search( data ) -> result
>>>      You can create a functional index on your smiles column, but I
>>> don't know  if this will help you ; you can do things like CREATE
>>> INDEX  ... ON  mytable( lower( myfield )), then SELECT ... FROM
>>> mytable WHERE   lower(myfield) = something, BUT in your case I gather
>>> your search  function  which processes the parsed data does a lot
>>> more than just  dome character  match, so creating a functional index
>>> on parse(smile)  would be useless for  selecting on
>>> search(parse(smile))...
>>>      So, in any case, if the parsing phase is slow, you can store
>>> the   preparsed data in a text or binary field and search it
>>> directly, but  this  will not be indexed.
>>>      If you can map a subset of your searchable properties to
>>> simple   datatypes, you could do a first search for related matches,
>>> as you said.
>>>      You say nothing about how your system works internally, whta
>>> kind  of  representation is it and what kind of searches do you
>>> actually do ?
>>>      On Sat, 08 Jan 2005 15:50:06 -0800, TJ O'Donnell <tjo@acm.org>
>>> wrote:
>>>
>>>> I've written a c-language extension to postgresql to implement a
>>>> chemical search of a varchar column (named smiles, typically).
>>>> It might be called as:
>>>>   oe_matches(smiles,'COCC') where 'COCC' is a typical search string.
>>>> This uses 3rd party functions to parse the varchar inputs into c++
>>>> objects.  In order to optimize this search, I'd like to parse the
>>>> whole  table containing smiles just once, store it and use the
>>>> parsed  data  instead of the varchar smiles.
>>>>
>>>> So, I could create another column, say smiles_ob and have the
>>>> oe_matches function take that column.  I'd have to be sure the
>>>> smiles_ob  column was related (by a trigger?) to the smiles column.
>>>> But I was thinking I might be able to hide these parsed objects
>>>> from  the  user by somehow incoporating the parsed objects into a
>>>> type of  index.
>>>> I'd like also to use additional columns (like molecular formula) in
>>>> the match function to "triage" the table to quickly rule out impossible
>>>> matches, before doing a full search/match.
>>>>
>>>> Am I way off the track here?  Is it a bad idea to clutter the index
>>>> with things like this?  Is it possible?   Is there another
>>>> approach that could hide some of these details from the user - meaning
>>>> they would not have to create/update these additional columns?
>>>>
>>>> Thanks,
>>>> TJ O'Donnell
>>>>
>>>> ---------------------------(end of
>>>> broadcast)---------------------------
>>>> TIP 9: the planner will ignore your desire to choose an index scan
>>>> if   your
>>>>       joining column's datatypes do not match
>>>>
>>>
>>
>

Re: how to optimize my c-extension functions

From
Tom Lane
Date:
"TJ O'Donnell" <tjo@acm.org> writes:
> The only type of search will be of the type:

> Select smiles,id from structure where  oe_matches(smiles,'c1ccccc1C(=O)N');

You haven't really said much about how you expect an index to be able to
help you with this, but I think if any index type can help you it will
be GiST.  What you would do is define an operator on top of the
oe_matches function, so that the above query is written say

Select smiles,id from structure where smiles ~~ 'c1ccccc1C(=O)N';

and then construct a GiST operator class that accepts ~~ as an
indexable operator.  There's not a huge amount of
plain-old-documentation about GiST but there are quite a few examples
available in the contrib/ tree.

I don't think you can completely hide the existence of the parsed
version of the smiles data.  The easiest way to go at it would be to
write the queries like

Select smiles,id from structure where smiles_parsed ~~ 'c1ccccc1C(=O)N';

where smiles_parsed is the extra column holding the parsed data, and
the ~~ operator is grabbed by a GiST index over that column.

Plan B would be to construct the index as a functional index and write

Select smiles,id from structure where parsed(smiles) ~~ 'c1ccccc1C(=O)N';

However plan B doesn't readily support applying any other operations to
the parsed data, since it doesn't exist anywhere except inside the
index.  Since you mentioned having other things you wanted to do with it,
I think you'll end up wanting the separate column.

            regards, tom lane

Re: how to optimize my c-extension functions

From
TJ O'Donnell
Date:
I was not hoping that indexing, per se, would help me.
In fact, indexing smiles would be of virtually no use
to me, except for exact matches, e.g. where smiles = 'CCCOC';
I was only trying to subvert the use of indexing for
my own purposes, to store the parsed smiles somewhere
automatic for the sql user, yet transparently available to my
functions for quick searches.

I think I've thought about this enough and gotten enough advice
to realize I should do this the straightforward way.
I should store the parsed smiles in a separate column,
have a trigger to keep it up to date, and require the
user to pass me the parsed_smiles column for quick searches.
And the user could maintain the parsed_smiles in a separate
table, if he so desired, with foreign key relations.

Thanks to everyone for all your advice.  This is my first
postgresql project and I'm liking what I've seen so far.

TJ

Tom Lane wrote:
> "TJ O'Donnell" <tjo@acm.org> writes:
>
>>The only type of search will be of the type:
>
>
>>Select smiles,id from structure where  oe_matches(smiles,'c1ccccc1C(=O)N');
>
>
> You haven't really said much about how you expect an index to be able to
> help you with this, but I think if any index type can help you it will
> be GiST.  What you would do is define an operator on top of the
> oe_matches function, so that the above query is written say
>
> Select smiles,id from structure where smiles ~~ 'c1ccccc1C(=O)N';
>
> and then construct a GiST operator class that accepts ~~ as an
> indexable operator.  There's not a huge amount of
> plain-old-documentation about GiST but there are quite a few examples
> available in the contrib/ tree.
>
> I don't think you can completely hide the existence of the parsed
> version of the smiles data.  The easiest way to go at it would be to
> write the queries like
>
> Select smiles,id from structure where smiles_parsed ~~ 'c1ccccc1C(=O)N';
>
> where smiles_parsed is the extra column holding the parsed data, and
> the ~~ operator is grabbed by a GiST index over that column.
>
> Plan B would be to construct the index as a functional index and write
>
> Select smiles,id from structure where parsed(smiles) ~~ 'c1ccccc1C(=O)N';
>
> However plan B doesn't readily support applying any other operations to
> the parsed data, since it doesn't exist anywhere except inside the
> index.  Since you mentioned having other things you wanted to do with it,
> I think you'll end up wanting the separate column.
>
>             regards, tom lane

Re: how to optimize my c-extension functions

From
Pierre-Frédéric Caillaud
Date:
    That's not what I meant...

    I meant, what does 'c1ccccc1C(=O)N' means ?
    If the search operation is too slow, you can narrow it using standard
postgres tools and then hand it down to your C functions. Let me explain,
I have no clue about this 'c1ccccc1C(=O)N' syntax, but I'll suppose you
will be searching for things like :

    1- molecule has N atoms of (whatever) element
    2- molecule has N single or double or triple covalent bonds
    3- molecule has such and such property

    Then, if you can understand the 'c1ccccc1C(=O)N' string and say that all
molecules that satisfy it will satisfy, for instance condition 2 above,
then you can have some fast searchable attributes in your database that
will mark all molecules satisfying condition 2, and you'll only need to
run the C search function on these to get the real matches.
    The idea is basically to narrow down the search to avoid calling the
expensive operator on all rows.

    If A and B and strings like your 'c1ccccc1C(=O)N', then if all molecules
satsfying B also satisfy A (thus B=>A or "B c A", B is contained in A in
set notation), if you can very quickly (with an index) grab the molecules
that satisfy A, and these are a significantly smaller number than the
whole set, then you'll speed your search a lot.
    If you can find some more A's, so that B c A1, B c A2, B c A3, then B c
(intersection of A1, A2, A3) which maps neatly to the gist index on an
integer array.
    So you could have a set of basic conditions, maybe a hundred or so, which
would be all tested on the search string to see which will apply to the
molecules this search string would find, then you translate this into a
GiST query.

    Are my explications making it clearer or just more obfuscated ?



> The only type of search will be of the type:
>
> Select smiles,id from structure where
> oe_matches(smiles,'c1ccccc1C(=O)N');
>
> or joins with other tables e.g.