Thread: Caching Websites
I'm writing a python script that (among other things) caches websites. Ultimately, the data is all stored in a string (pickled, possibly zipped). (lots of related stuff in stored in postgresql tables). I am wondering whether it would be better to store each website in a record in a table, or instead have a table which links URLs to filenames (the file would contain the pickled website). The sites will of course vary greatly in size, but typically be between 1k and 200k (I probably won't store anything bigger than that). This seems like a simple question, and I suspect there's an obvious answer for which data storage method makes more sense, I just don't know how to go about researching that. What would be the considerations for using one method of data storage vs. the other? Any suggestions for me? --Adam
Attachment
On Friday 09 May 2003 9:48 pm, Adam Kessel wrote: > I am wondering whether it would be better to store each website in a > record in a table, or instead have a table which links URLs to filenames > (the file would contain the pickled website). The sites will of course > vary greatly in size, but typically be between 1k and 200k (I probably > won't store anything bigger than that). > > This seems like a simple question, and I suspect there's an obvious > answer for which data storage method makes more sense, I just don't know > how to go about researching that. What would be the considerations for > using one method of data storage vs. the other? > > Any suggestions for me? Not that simple a question - look back through the archives for plenty of discussions (usually regarding images). My personal approach is to ask myself whether I'm going to access/process the data in any way. Basically if I want to do any of: 1. query the large data 2. summarise it 3. have transaction-based update control then I'll store it in the database. If not, I'll store a path to the file. -- Richard Huxton
Someone else suggested using a 'large object', which I didn't know about: http://www.postgresql.org/docs/view.php?version=7.3&idoc=1&file=largeobjects.html It sounds like a large object is almost the same as storing files and paths to those files, with stricter data integrity. I don't really plan on doing any database operations with the contents of these large objects--all manipulations will be done in Python once the data is retrieved. But it still seems cleaner to not have to maintain two parallel storage systems (database and filesystem) and make sure they don't get out of sync. Based on the documetation, I don't immediately see any disadvantage to using these large objects--does anyone else see why I might not want to store archived websites in large objects? --Adam Kessel On Mon, May 12, 2003 at 09:39:19AM +0100, Richard Huxton wrote: > On Friday 09 May 2003 9:48 pm, Adam Kessel wrote: > > I am wondering whether it would be better to store each website in a > > record in a table, or instead have a table which links URLs to filenames > > (the file would contain the pickled website). The sites will of course > > vary greatly in size, but typically be between 1k and 200k (I probably > > won't store anything bigger than that). > > > > This seems like a simple question, and I suspect there's an obvious > > answer for which data storage method makes more sense, I just don't know > > how to go about researching that. What would be the considerations for > > using one method of data storage vs. the other? > > > > Any suggestions for me? > Not that simple a question - look back through the archives for plenty of > discussions (usually regarding images). > > My personal approach is to ask myself whether I'm going to access/process the > data in any way. Basically if I want to do any of: > 1. query the large data > 2. summarise it > 3. have transaction-based update control > then I'll store it in the database. If not, I'll store a path to the file.
Attachment
Adam Kessel <adam@bostoncoop.net> writes: > Based on the documetation, I don't immediately see any disadvantage to > using these large objects--does anyone else see why I might not want to > store archived websites in large objects? It's going to be (probably) a little slower than the filesystem solution, and backups are a little more involved (you can't use pg_dumpall) but everything works--I have been using LOs with success for a couple years now. -Doug
On 12 May 2003, Doug McNaught wrote: > Adam Kessel <adam@bostoncoop.net> writes: > > > Based on the documetation, I don't immediately see any disadvantage to > > using these large objects--does anyone else see why I might not want to > > store archived websites in large objects? > > It's going to be (probably) a little slower than the filesystem > solution, and backups are a little more involved (you can't use > pg_dumpall) but everything works--I have been using LOs with success > for a couple years now. If the files aren't too big (under a meg or so each) you can either try bytea encoding / bytea field types, or you can base64 encode, escape, and store it in a text field. Since pgsql autocompresses text fields, the fact that base64 is a little bigger is no big deal. The advantage to storing them in bytea or text with base64 is that pg_dump backs up your whole database.
"scott.marlowe" <scott.marlowe@ihs.com> writes: > The advantage to storing them in bytea or text with base64 is that > pg_dump backs up your whole database. It does with LOs too; you just have to use the -o option and either the 'custom' or 'tar' format rather than straight SQL. -Doug
On 12 May 2003, Doug McNaught wrote: > "scott.marlowe" <scott.marlowe@ihs.com> writes: > > > The advantage to storing them in bytea or text with base64 is that > > pg_dump backs up your whole database. > > It does with LOs too; you just have to use the -o option and either > the 'custom' or 'tar' format rather than straight SQL. Cool. I could of sworn that you had to back them up seperately. Was that the case at one time?
"scott.marlowe" <scott.marlowe@ihs.com> writes: > On 12 May 2003, Doug McNaught wrote: > > > It does with LOs too; you just have to use the -o option and either > > the 'custom' or 'tar' format rather than straight SQL. > > Cool. I could of sworn that you had to back them up seperately. Was that > the case at one time? Maybe--I don't remember. You do have to use an individual pg_dump command for each database--the pg_dumpall command won't take the -o option. -Doug
Maybe a little out of the loop... but if you're caching website stuff (html?, xml?), then it might be best not to use the Database. If your DB goes down... your content site goes down too. I remember a project a little while back where we actually used plain ol, DBM files to cache the content. It was tens of times faster than the database, and would stay up no matter what. I see what your're saying about the LO's but IMHO, the DB is not the best place for cached content. - Ericson Smith eric@did-it.com On Mon, 2003-05-12 at 12:04, scott.marlowe wrote: > On 12 May 2003, Doug McNaught wrote: > > > "scott.marlowe" <scott.marlowe@ihs.com> writes: > > > > > The advantage to storing them in bytea or text with base64 is that > > > pg_dump backs up your whole database. > > > > It does with LOs too; you just have to use the -o option and either > > the 'custom' or 'tar' format rather than straight SQL. > > Cool. I could of sworn that you had to back them up seperately. Was that > the case at one time? > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org -- Ericson Smith <eric@did-it.com>
And if you're looking at performance caching, try SQUID. It's complex to setup for most stuff, but as a simple single site http accelerator, it's pretty fast, and not nearly as hard to setup as when it is using a redirector (i.e. multi-backend / multi-frontend setup). On 12 May 2003, Ericson Smith wrote: > Maybe a little out of the loop... but if you're caching website stuff > (html?, xml?), then it might be best not to use the Database. If your DB > goes down... your content site goes down too. > > I remember a project a little while back where we actually used plain > ol, DBM files to cache the content. It was tens of times faster than the > database, and would stay up no matter what. > > I see what your're saying about the LO's but IMHO, the DB is not the > best place for cached content. > > - Ericson Smith > eric@did-it.com > > On Mon, 2003-05-12 at 12:04, scott.marlowe wrote: > > On 12 May 2003, Doug McNaught wrote: > > > > > "scott.marlowe" <scott.marlowe@ihs.com> writes: > > > > > > > The advantage to storing them in bytea or text with base64 is that > > > > pg_dump backs up your whole database. > > > > > > It does with LOs too; you just have to use the -o option and either > > > the 'custom' or 'tar' format rather than straight SQL. > > > > Cool. I could of sworn that you had to back them up seperately. Was that > > the case at one time? > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 6: Have you searched our list archives? > > > > http://archives.postgresql.org >