Re: how to optimize my c-extension functions - Mailing list pgsql-general
From | TJ O'Donnell |
---|---|
Subject | Re: how to optimize my c-extension functions |
Date | |
Msg-id | 41E1D9B8.3030306@acm.org Whole thread Raw |
In response to | Re: how to optimize my c-extension functions (Pierre-Frédéric Caillaud<lists@boutiquenumerique.com>) |
Responses |
Re: how to optimize my c-extension functions
Re: how to optimize my c-extension functions |
List | pgsql-general |
Let me first say that I will routinely be dealing with one million+ rows, so I want to take care to optimize my work as much as possible, and to consider carefully my design decisions. The only type of search will be of the type: Select smiles,id from structure where oe_matches(smiles,'c1ccccc1C(=O)N'); or joins with other tables e.g. Select id,smiles,clogp from structure,properties where oe_matches(smiles,'c1ccccc1C(=O)N') and id = prop_id; with id being a sequence (hence unique) and foreign key prop_id column of properties. There may be other useful functions of smiles, e.g. int oe_count_matches(smiles,'CCC'), and these would also prefer to use the pre-parsed smiles c++ object. After I parse the smiles, the character string smiles is really of no use anymore. It might be output, for example to an external program such as smiles_to_jpeg which re-parses the smiles and need not be fast. So, there is no real use for indexing smiles. So I want to borrow the internal tables normally used for indexing to store my parsed smiles and use the parsed smiles in oe_matches and other functions. If I do this, maybe I would have to use operators (=,<,>,LIKE?) to do the matching. A c-function is simply called with data and would have no access to indexes, correct? TJ Pierre-Frédéric Caillaud wrote: > > Well, first and easy thing you can do is create a column to store > the parsed representation and update it via a trigger when the > original, unparsed column is updated or inserted. > Is this sufficiently "hidden from the user" for you ? I know it's > not really hidden, but the fact that updating is automatic could be > good enough. > You could also store this column in another table and join with the > main table. > > What are the kinds of searches you do more often ? Can you give a > few examples ? > > >> Yes, my c function and it's sql counterpart, oe_matches(smiles) >> uses two steps (1) parse smiles (2) search parsed smiles. >> Parsing is expensive. The smiles has an external string representation, >> which is stored in a smiles column, but only the parsed form is >> actually searchable. >> The smiles representation is never searched in a traditional string >> manner, except perhaps for a direct lookup (string equals). >> LIKE has no meaning for smiles strings, similarly < or > are >> meaningless. >> >> Smiles is parsed into atom and bond representations using >> 3rd party c++ code/methods. I simply call their methods >> to parse and search. A binary string can be got from them >> for persistent storage in a postgres column. It can then be >> restored into a c++ object for searching, thus avoiding the >> parsing stage, except for the initial creation of a row with >> a smiles column. >> >> My goal here is to optimize the search by storing the parsed smiles, >> YET HIDE THIS FROM THE USER. I thought I might be able to store >> the parsed smiles in an index for me to use while searching, even >> though it would not be used for indexing in the traditional manner. >> This would mean creating a new indexing method. I've read up on this >> and it seems a daunting task. Am I perverting the index method if >> I try to do this? >> >> So, aside from having the user be responsible for maintaining a >> column of parsed smiles (and possibly other related columns which >> would speed up the search), is there a way I can create and maintain >> a table related to the table containing the smiles - and all >> behind the scenes so the sql user is unaware of this. >> My thought was that an index is like that and I might borrow some >> of the internal uses of indexing for my purposes. >> >> TJ O'Donnell >> tjo@acm.org >> >> Pierre-Frédéric Caillaud wrote: >> >>> I gather your program uses two steps, let's call them : >>> - parse( smiles ) -> data >>> - search( data ) -> result >>> You can create a functional index on your smiles column, but I >>> don't know if this will help you ; you can do things like CREATE >>> INDEX ... ON mytable( lower( myfield )), then SELECT ... FROM >>> mytable WHERE lower(myfield) = something, BUT in your case I gather >>> your search function which processes the parsed data does a lot >>> more than just dome character match, so creating a functional index >>> on parse(smile) would be useless for selecting on >>> search(parse(smile))... >>> So, in any case, if the parsing phase is slow, you can store >>> the preparsed data in a text or binary field and search it >>> directly, but this will not be indexed. >>> If you can map a subset of your searchable properties to >>> simple datatypes, you could do a first search for related matches, >>> as you said. >>> You say nothing about how your system works internally, whta >>> kind of representation is it and what kind of searches do you >>> actually do ? >>> On Sat, 08 Jan 2005 15:50:06 -0800, TJ O'Donnell <tjo@acm.org> >>> wrote: >>> >>>> I've written a c-language extension to postgresql to implement a >>>> chemical search of a varchar column (named smiles, typically). >>>> It might be called as: >>>> oe_matches(smiles,'COCC') where 'COCC' is a typical search string. >>>> This uses 3rd party functions to parse the varchar inputs into c++ >>>> objects. In order to optimize this search, I'd like to parse the >>>> whole table containing smiles just once, store it and use the >>>> parsed data instead of the varchar smiles. >>>> >>>> So, I could create another column, say smiles_ob and have the >>>> oe_matches function take that column. I'd have to be sure the >>>> smiles_ob column was related (by a trigger?) to the smiles column. >>>> But I was thinking I might be able to hide these parsed objects >>>> from the user by somehow incoporating the parsed objects into a >>>> type of index. >>>> I'd like also to use additional columns (like molecular formula) in >>>> the match function to "triage" the table to quickly rule out impossible >>>> matches, before doing a full search/match. >>>> >>>> Am I way off the track here? Is it a bad idea to clutter the index >>>> with things like this? Is it possible? Is there another >>>> approach that could hide some of these details from the user - meaning >>>> they would not have to create/update these additional columns? >>>> >>>> Thanks, >>>> TJ O'Donnell >>>> >>>> ---------------------------(end of >>>> broadcast)--------------------------- >>>> TIP 9: the planner will ignore your desire to choose an index scan >>>> if your >>>> joining column's datatypes do not match >>>> >>> >> >
pgsql-general by date: