Re: how to optimize my c-extension functions - Mailing list pgsql-general

From TJ O'Donnell
Subject Re: how to optimize my c-extension functions
Date
Msg-id 41E1D9B8.3030306@acm.org
Whole thread Raw
In response to Re: how to optimize my c-extension functions  (Pierre-Frédéric Caillaud<lists@boutiquenumerique.com>)
Responses Re: how to optimize my c-extension functions  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: how to optimize my c-extension functions  (Pierre-Frédéric Caillaud<lists@boutiquenumerique.com>)
List pgsql-general
Let me first say that I will routinely be dealing with
one million+ rows, so I want to take care to optimize
my work as much as possible, and to consider carefully
my design decisions.

The only type of search will be of the type:

Select smiles,id from structure where  oe_matches(smiles,'c1ccccc1C(=O)N');

or joins with other tables e.g.

Select id,smiles,clogp from structure,properties where
oe_matches(smiles,'c1ccccc1C(=O)N') and id = prop_id;
with id being a sequence (hence unique) and foreign
key prop_id column of properties.

There may be other useful functions of smiles, e.g.
int oe_count_matches(smiles,'CCC'),
and these would also prefer to use the pre-parsed smiles
c++ object.

After I parse the smiles,
the character string smiles is really of no use anymore.
It might be output, for example to an external program such as
smiles_to_jpeg which re-parses the smiles and need not be fast.
So, there is no real use for indexing smiles.  So I want to
borrow the internal tables normally used for indexing to store
my parsed smiles and use the parsed smiles in oe_matches and other
functions.
If I do this, maybe I would have to use operators (=,<,>,LIKE?)
to do the matching.  A c-function is simply called with data and
would have no access to indexes, correct?

TJ


Pierre-Frédéric Caillaud wrote:
>
>     Well, first and easy thing you can do is create a column to store
> the  parsed representation and update it via a trigger when the
> original,  unparsed column is updated or inserted.
>     Is this sufficiently "hidden from the user" for you ? I know it's
> not  really hidden, but the fact that updating is automatic could be
> good  enough.
>     You could also store this column in another table and join with the
> main  table.
>
>     What are the kinds of searches you do more often ? Can you give a
> few  examples ?
>
>
>> Yes, my c function and it's sql counterpart, oe_matches(smiles)
>> uses two steps (1) parse smiles (2) search parsed smiles.
>> Parsing is expensive.  The smiles has an external string representation,
>> which is stored in a smiles column, but only the parsed form is
>> actually  searchable.
>> The smiles representation is never searched in a traditional string
>> manner, except perhaps for a direct lookup (string equals).
>> LIKE has no meaning for smiles strings, similarly < or > are
>> meaningless.
>>
>> Smiles is parsed into atom and bond representations using
>> 3rd party c++ code/methods.  I simply call their methods
>> to parse and search.  A binary string can be got from them
>> for persistent storage in a postgres column.  It can then be
>> restored into a c++ object for searching, thus avoiding the
>> parsing stage, except for the initial creation of a row with
>> a smiles column.
>>
>> My goal here is to optimize the search by storing the parsed smiles,
>> YET HIDE THIS FROM THE USER.  I thought I might be able to store
>> the parsed smiles in an index for me to use while searching, even
>> though it would not be used for indexing in the traditional manner.
>> This would mean creating a new indexing method.  I've read up on this
>> and it seems a daunting task.  Am I perverting the index method if
>> I try to do this?
>>
>> So, aside from having the user be responsible for maintaining a
>> column of parsed smiles (and possibly other related columns which
>> would speed up the search), is there a way I can create and maintain
>> a table related to the table containing the smiles - and all
>> behind the scenes so the sql user is unaware of this.
>> My thought was that an index is like that and I might borrow some
>> of the internal uses of indexing for my purposes.
>>
>> TJ O'Donnell
>> tjo@acm.org
>>
>> Pierre-Frédéric Caillaud wrote:
>>
>>>     I gather your program uses two steps, let's call them :
>>>     - parse( smiles ) -> data
>>>     - search( data ) -> result
>>>      You can create a functional index on your smiles column, but I
>>> don't know  if this will help you ; you can do things like CREATE
>>> INDEX  ... ON  mytable( lower( myfield )), then SELECT ... FROM
>>> mytable WHERE   lower(myfield) = something, BUT in your case I gather
>>> your search  function  which processes the parsed data does a lot
>>> more than just  dome character  match, so creating a functional index
>>> on parse(smile)  would be useless for  selecting on
>>> search(parse(smile))...
>>>      So, in any case, if the parsing phase is slow, you can store
>>> the   preparsed data in a text or binary field and search it
>>> directly, but  this  will not be indexed.
>>>      If you can map a subset of your searchable properties to
>>> simple   datatypes, you could do a first search for related matches,
>>> as you said.
>>>      You say nothing about how your system works internally, whta
>>> kind  of  representation is it and what kind of searches do you
>>> actually do ?
>>>      On Sat, 08 Jan 2005 15:50:06 -0800, TJ O'Donnell <tjo@acm.org>
>>> wrote:
>>>
>>>> I've written a c-language extension to postgresql to implement a
>>>> chemical search of a varchar column (named smiles, typically).
>>>> It might be called as:
>>>>   oe_matches(smiles,'COCC') where 'COCC' is a typical search string.
>>>> This uses 3rd party functions to parse the varchar inputs into c++
>>>> objects.  In order to optimize this search, I'd like to parse the
>>>> whole  table containing smiles just once, store it and use the
>>>> parsed  data  instead of the varchar smiles.
>>>>
>>>> So, I could create another column, say smiles_ob and have the
>>>> oe_matches function take that column.  I'd have to be sure the
>>>> smiles_ob  column was related (by a trigger?) to the smiles column.
>>>> But I was thinking I might be able to hide these parsed objects
>>>> from  the  user by somehow incoporating the parsed objects into a
>>>> type of  index.
>>>> I'd like also to use additional columns (like molecular formula) in
>>>> the match function to "triage" the table to quickly rule out impossible
>>>> matches, before doing a full search/match.
>>>>
>>>> Am I way off the track here?  Is it a bad idea to clutter the index
>>>> with things like this?  Is it possible?   Is there another
>>>> approach that could hide some of these details from the user - meaning
>>>> they would not have to create/update these additional columns?
>>>>
>>>> Thanks,
>>>> TJ O'Donnell
>>>>
>>>> ---------------------------(end of
>>>> broadcast)---------------------------
>>>> TIP 9: the planner will ignore your desire to choose an index scan
>>>> if   your
>>>>       joining column's datatypes do not match
>>>>
>>>
>>
>

pgsql-general by date:

Previous
From: Pierre-Frédéric Caillaud
Date:
Subject: Re: how to optimize my c-extension functions
Next
From: Tom Lane
Date:
Subject: Re: how to optimize my c-extension functions