Thread: to pickle or not to pickle
I'm writing a search engine using python and postgresql which requires to store a temporary list of results in an sql table for each request. This list will contain at least 50 records and could grow to about 300. My options are either to pickle the list and store a single entry or use the postgresql COPY command (as opposed to INSERT which would be too slow) to store each of the temporary records. Question is, how can I make an educated decision on which option to select? What kind of questions should I be asking myself? Should I actually go through the trouble of implementing both alternatives and profiling each seperately? If so, how can I predict what will happen under a heavy load which is hard to simulate when benchmarking each option? Thanks
Marc Tardif wrote: > I'm writing a search engine using python and postgresql which requires to > store a temporary list of results in an sql table for each request. This > list will contain at least 50 records and could grow to about 300. My > options are either to pickle the list and store a single entry or use the > postgresql COPY command (as opposed to INSERT which would be too slow) to > store each of the temporary records. > > You are writing a search engine : does that mean that you need to search > the > web and that you want to store your temporary results in a table, OR > does that mean that you are writing a QUERY screen, from which you > generate a SELECT statement to query your POSTGRES database ? > > Also what size are your tuples ? > > Do you need these temporary results within the same program, or do you > need to pass them somewhere to another program ? > > > Question is, how can I make an educated decision on which option to > select? What kind of questions should I be asking myself? Should I > actually go through the trouble of implementing both alternatives and > profiling each seperately? If so, how can I predict what will happen under > a heavy load which is hard to simulate when benchmarking each option? > Always go for a simple solution. This may (paradoxically) need some more study. One of the first questions you should ask yourself, is it really necessary to store this temporary result ? If so, then why take the pickle option ? Pickling is meant for persistent data, which is really more a mechanism to store data between sessions. Maybe you should consider the option which is used in traditional IT : just store your data in a sequential file. Much less overhead, because your OS handles it directly. Concerning the benchmarking, it seems as if the only way to do this is to automatically start scripts which do what needs to be done and then measure what happens : nr of processes, CPU and IO-load. Jurgen Defurne defurnj@glo.be
Marc Tardif wrote: > > > I'm writing a search engine using python and postgresql which requires to > > > store a temporary list of results in an sql table for each request. This > > > list will contain at least 50 records and could grow to about 300. My > > > options are either to pickle the list and store a single entry or use the > > > postgresql COPY command (as opposed to INSERT which would be too slow) to > > > store each of the temporary records. > > > > > > > > You are writing a search engine : does that mean that you need to search > > > the > > > web and that you want to store your temporary results in a table, OR > > > does that mean that you are writing a QUERY screen, from which you > > > generate a SELECT statement to query your POSTGRES database ? > > > > > > Also what size are your tuples ? > > > > > > Do you need these temporary results within the same program, or do you > > > need to pass them somewhere to another program ? > > The former, search the web and store temporary results in a table. As for > the tuples, I can expect each to be <100bytes. Finally, the temporary > results will only be used by the same program. > If your temporary results ARE really to be used by the same program, then I suggest that you use a solution whereby you keep your temp results in a datastructure in memory, and not write them to any table or temporary file. Python has enough basic and extended datastructures to do that. If your tuplesize is 100 bytes and you are sure that you have a maximum of 300 tuples, then you will spend approximately 30 Kb of memory (not counting run-time overhead). Using a simple list to store your data will simplify your life much, and you don't need to worry about memory management. Good luck. Jurgen Defurne defurnj@glo.be
At 11:56 AM 31-05-2000 -0400, Marc Tardif wrote: >I'm writing a search engine using python and postgresql which requires to >store a temporary list of results in an sql table for each request. This >list will contain at least 50 records and could grow to about 300. My >options are either to pickle the list and store a single entry or use the >postgresql COPY command (as opposed to INSERT which would be too slow) to >store each of the temporary records. Are you trying to do: "showing 20 results" click next/previous for next/previous 20. Whatever it is, I don't think you should use COPY. The way I did it was to just do the query again, and only display the relevant results, using offset and window values. Not as efficient, but: 1) I wanted to know how many rows there were- so if I used SELECT .. LIMIT, I'd have to do a SELECT count first, but AFAIK, Postgresql has not special optimizations for SELECT count (not even sure if other databases would be faster for _my_ SELECT count). 2) I didn't want to deal with cleaning up the cache/pickles... My app was web based, so I don't know when the users have left. Say I expire the cache/pickles after 15 minutes. If I have 100 searches per minute, I'd end up having 1500 pickles at a time 8*). Not really a big problem nowadays, but I didn't think it was worth dealing with. 3) It wasn't really a search engine- different results for different users, different ways of sorting stuff etc. But if your search engine returns the same result given the same query no matter who the user is, the cache thing could be good. May mean a redesign- have a cache table storing queries and results (and expiry). You will probably require regular vacuuming, since the cache table will be changing quite often. e.g. each row: query string, result1,result2, sequence, total results, expiry time. By storing the total results you can use Postgresql's LIMIT feature more intelligently. You can probably afford to waste the 4 bytes per row, and keep everything in one table for speed. Cheerio, Link.
At 16:28 05/06/00 +0800, you wrote: >At 11:56 AM 31-05-2000 -0400, Marc Tardif wrote: > >I'm writing a search engine using python and postgresql which requires to > >store a temporary list of results in an sql table for each request. This > >list will contain at least 50 records and could grow to about 300. My > >options are either to pickle the list and store a single entry or use the > >postgresql COPY command (as opposed to INSERT which would be too slow) to > >store each of the temporary records. > >Are you trying to do: > >"showing 20 results" click next/previous for next/previous 20. If you _are_ trying to do this, and if its a web-based development, you might like to look at Zope (www.zope.org). It works well with PostgreSQL. It will do this for you automatically. (Take a look at the SQL Methods docs on that site). Its written in and uses python so you should feel really at home. Let me know if you need any more help on Zope. Richard Richard Moon richard@dcs.co.uk