Re: how to optimize my c-extension functions - Mailing list pgsql-general

From Pierre-Frédéric Caillaud
Subject Re: how to optimize my c-extension functions
Date
Msg-id opskcx5lp1cq72hf@musicbox
Whole thread Raw
In response to Re: how to optimize my c-extension functions  (TJ O'Donnell <tjo@acm.org>)
Responses Re: how to optimize my c-extension functions  (TJ O'Donnell <tjo@acm.org>)
List pgsql-general
    Well, first and easy thing you can do is create a column to store the
parsed representation and update it via a trigger when the original,
unparsed column is updated or inserted.
    Is this sufficiently "hidden from the user" for you ? I know it's not
really hidden, but the fact that updating is automatic could be good
enough.
    You could also store this column in another table and join with the main
table.

    What are the kinds of searches you do more often ? Can you give a few
examples ?


> Yes, my c function and it's sql counterpart, oe_matches(smiles)
> uses two steps (1) parse smiles (2) search parsed smiles.
> Parsing is expensive.  The smiles has an external string representation,
> which is stored in a smiles column, but only the parsed form is actually
> searchable.
> The smiles representation is never searched in a traditional string
> manner, except perhaps for a direct lookup (string equals).
> LIKE has no meaning for smiles strings, similarly < or > are
> meaningless.
>
> Smiles is parsed into atom and bond representations using
> 3rd party c++ code/methods.  I simply call their methods
> to parse and search.  A binary string can be got from them
> for persistent storage in a postgres column.  It can then be
> restored into a c++ object for searching, thus avoiding the
> parsing stage, except for the initial creation of a row with
> a smiles column.
>
> My goal here is to optimize the search by storing the parsed smiles,
> YET HIDE THIS FROM THE USER.  I thought I might be able to store
> the parsed smiles in an index for me to use while searching, even
> though it would not be used for indexing in the traditional manner.
> This would mean creating a new indexing method.  I've read up on this
> and it seems a daunting task.  Am I perverting the index method if
> I try to do this?
>
> So, aside from having the user be responsible for maintaining a
> column of parsed smiles (and possibly other related columns which
> would speed up the search), is there a way I can create and maintain
> a table related to the table containing the smiles - and all
> behind the scenes so the sql user is unaware of this.
> My thought was that an index is like that and I might borrow some
> of the internal uses of indexing for my purposes.
>
> TJ O'Donnell
> tjo@acm.org
>
> Pierre-Frédéric Caillaud wrote:
>>     I gather your program uses two steps, let's call them :
>>     - parse( smiles ) -> data
>>     - search( data ) -> result
>>      You can create a functional index on your smiles column, but I
>> don't know  if this will help you ; you can do things like CREATE INDEX
>> ... ON  mytable( lower( myfield )), then SELECT ... FROM mytable WHERE
>> lower(myfield) = something, BUT in your case I gather your search
>> function  which processes the parsed data does a lot more than just
>> dome character  match, so creating a functional index on parse(smile)
>> would be useless for  selecting on search(parse(smile))...
>>      So, in any case, if the parsing phase is slow, you can store the
>> preparsed data in a text or binary field and search it directly, but
>> this  will not be indexed.
>>      If you can map a subset of your searchable properties to simple
>> datatypes, you could do a first search for related matches, as you said.
>>      You say nothing about how your system works internally, whta kind
>> of  representation is it and what kind of searches do you actually do ?
>>      On Sat, 08 Jan 2005 15:50:06 -0800, TJ O'Donnell <tjo@acm.org>
>> wrote:
>>
>>> I've written a c-language extension to postgresql to implement a
>>> chemical search of a varchar column (named smiles, typically).
>>> It might be called as:
>>>   oe_matches(smiles,'COCC') where 'COCC' is a typical search string.
>>> This uses 3rd party functions to parse the varchar inputs into c++
>>> objects.  In order to optimize this search, I'd like to parse the
>>> whole  table containing smiles just once, store it and use the parsed
>>> data  instead of the varchar smiles.
>>>
>>> So, I could create another column, say smiles_ob and have the
>>> oe_matches function take that column.  I'd have to be sure the
>>> smiles_ob  column was related (by a trigger?) to the smiles column.
>>> But I was thinking I might be able to hide these parsed objects from
>>> the  user by somehow incoporating the parsed objects into a type of
>>> index.
>>> I'd like also to use additional columns (like molecular formula) in
>>> the match function to "triage" the table to quickly rule out impossible
>>> matches, before doing a full search/match.
>>>
>>> Am I way off the track here?  Is it a bad idea to clutter the index
>>> with things like this?  Is it possible?   Is there another
>>> approach that could hide some of these details from the user - meaning
>>> they would not have to create/update these additional columns?
>>>
>>> Thanks,
>>> TJ O'Donnell
>>>
>>> ---------------------------(end of
>>> broadcast)---------------------------
>>> TIP 9: the planner will ignore your desire to choose an index scan if
>>> your
>>>       joining column's datatypes do not match
>>>
>>
>



pgsql-general by date:

Previous
From: "Constanze Tschritter"
Date:
Subject: Re: configuration problems of postgresql under windows xp
Next
From: TJ O'Donnell
Date:
Subject: Re: how to optimize my c-extension functions