Thread: Caching Websites

Caching Websites

From

Adam Kessel

Date:

09 May 2003, 16:49:25

I'm writing a python script that (among other things) caches websites.
Ultimately, the data is all stored in a string (pickled, possibly
zipped).  (lots of related stuff in stored in postgresql tables).

I am wondering whether it would be better to store each website in a
record in a table, or instead have a table which links URLs to filenames
(the file would contain the pickled website).  The sites will of course
vary greatly in size, but typically be between 1k and 200k (I probably
won't store anything bigger than that).

This seems like a simple question, and I suspect there's an obvious
answer for which data storage method makes more sense, I just don't know
how to go about researching that.  What would be the considerations for
using one method of data storage vs. the other?

Any suggestions for me?

--Adam

Attachment

msg-32091-59543.dat

Re: Caching Websites

From

Richard Huxton

Date:

12 May 2003, 04:39:19

On Friday 09 May 2003 9:48 pm, Adam Kessel wrote:
> I am wondering whether it would be better to store each website in a
> record in a table, or instead have a table which links URLs to filenames
> (the file would contain the pickled website).  The sites will of course
> vary greatly in size, but typically be between 1k and 200k (I probably
> won't store anything bigger than that).
>
> This seems like a simple question, and I suspect there's an obvious
> answer for which data storage method makes more sense, I just don't know
> how to go about researching that.  What would be the considerations for
> using one method of data storage vs. the other?
>
> Any suggestions for me?

Not that simple a question - look back through the archives for plenty of
discussions (usually regarding images).

My personal approach is to ask myself whether I'm going to access/process the
data in any way. Basically if I want to do any of:
  1. query the large data
  2. summarise it
  3. have transaction-based update control
then I'll store it in the database. If not, I'll store a path to the file.

--
  Richard Huxton

Re: Caching Websites

From

Adam Kessel

Date:

12 May 2003, 09:52:23

Someone else suggested using a 'large object', which I didn't know about:

http://www.postgresql.org/docs/view.php?version=7.3&idoc=1&file=largeobjects.html

It sounds like a large object is almost the same as storing files and
paths to those files, with stricter data integrity.

I don't really plan on doing any database operations with the contents of
these large objects--all manipulations will be done in Python once the
data is retrieved.  But it still seems cleaner to not have to maintain
two parallel storage systems (database and filesystem) and make sure they
don't get out of sync.

Based on the documetation, I don't immediately see any disadvantage to
using these large objects--does anyone else see why I might not want to
store archived websites in large objects?

--Adam Kessel

On Mon, May 12, 2003 at 09:39:19AM +0100, Richard Huxton wrote:
> On Friday 09 May 2003 9:48 pm, Adam Kessel wrote:
> > I am wondering whether it would be better to store each website in a
> > record in a table, or instead have a table which links URLs to filenames
> > (the file would contain the pickled website).  The sites will of course
> > vary greatly in size, but typically be between 1k and 200k (I probably
> > won't store anything bigger than that).
> >
> > This seems like a simple question, and I suspect there's an obvious
> > answer for which data storage method makes more sense, I just don't know
> > how to go about researching that.  What would be the considerations for
> > using one method of data storage vs. the other?
> >
> > Any suggestions for me?
> Not that simple a question - look back through the archives for plenty of
> discussions (usually regarding images).
>
> My personal approach is to ask myself whether I'm going to access/process the
> data in any way. Basically if I want to do any of:
>   1. query the large data
>   2. summarise it
>   3. have transaction-based update control
> then I'll store it in the database. If not, I'll store a path to the file.

Attachment

msg-32091-59579.dat

Re: Caching Websites

From

Doug McNaught

Date:

12 May 2003, 09:58:40

Adam Kessel <adam@bostoncoop.net> writes:

> Based on the documetation, I don't immediately see any disadvantage to
> using these large objects--does anyone else see why I might not want to
> store archived websites in large objects?

It's going to be (probably) a little slower than the filesystem
solution, and backups are a little more involved (you can't use
pg_dumpall) but everything works--I have been using LOs with success
for a couple years now.

-Doug

Re: Caching Websites

From

"scott.marlowe"

Date:

12 May 2003, 11:55:26

On 12 May 2003, Doug McNaught wrote:

> Adam Kessel <adam@bostoncoop.net> writes:
>
> > Based on the documetation, I don't immediately see any disadvantage to
> > using these large objects--does anyone else see why I might not want to
> > store archived websites in large objects?
>
> It's going to be (probably) a little slower than the filesystem
> solution, and backups are a little more involved (you can't use
> pg_dumpall) but everything works--I have been using LOs with success
> for a couple years now.

If the files aren't too big (under a meg or so each) you can either try
bytea encoding / bytea field types, or you can base64 encode, escape, and
store it in a text field.  Since pgsql autocompresses text fields, the
fact that base64 is a little bigger is no big deal.

The advantage to storing them in bytea or text with base64 is that
pg_dump backs up your whole database.

Re: Caching Websites

From

Doug McNaught

Date:

12 May 2003, 12:01:48

"scott.marlowe" <scott.marlowe@ihs.com> writes:

> The advantage to storing them in bytea or text with base64 is that
> pg_dump backs up your whole database.

It does with LOs too; you just have to use the -o option and either
the 'custom' or 'tar' format rather than straight SQL.

-Doug

Re: Caching Websites

From

"scott.marlowe"

Date:

12 May 2003, 12:17:25

On 12 May 2003, Doug McNaught wrote:

> "scott.marlowe" <scott.marlowe@ihs.com> writes:
>
> > The advantage to storing them in bytea or text with base64 is that
> > pg_dump backs up your whole database.
>
> It does with LOs too; you just have to use the -o option and either
> the 'custom' or 'tar' format rather than straight SQL.

Cool.  I could of sworn that you had to back them up seperately.  Was that
the case at one time?

Re: Caching Websites

From

Doug McNaught

Date:

12 May 2003, 13:24:09

"scott.marlowe" <scott.marlowe@ihs.com> writes:

> On 12 May 2003, Doug McNaught wrote:
>
> > It does with LOs too; you just have to use the -o option and either
> > the 'custom' or 'tar' format rather than straight SQL.
>
> Cool.  I could of sworn that you had to back them up seperately.  Was that
> the case at one time?

Maybe--I don't remember.

You do have to use an individual pg_dump command for each
database--the pg_dumpall command won't take the -o option.

-Doug

Re: Caching Websites

From

Ericson Smith

Date:

12 May 2003, 13:27:53

Maybe a little out of the loop... but if you're caching website stuff
(html?, xml?), then it might be best not to use the Database. If your DB
goes down... your content site goes down too.

I remember a project a little while back where we actually used plain
ol, DBM files to cache the content. It was tens of times faster than the
database, and would stay up no matter what.

I see what your're saying about the LO's but IMHO, the DB is not the
best place for cached content.

- Ericson Smith
eric@did-it.com

On Mon, 2003-05-12 at 12:04, scott.marlowe wrote:
> On 12 May 2003, Doug McNaught wrote:
>
> > "scott.marlowe" <scott.marlowe@ihs.com> writes:
> >
> > > The advantage to storing them in bytea or text with base64 is that
> > > pg_dump backs up your whole database.
> >
> > It does with LOs too; you just have to use the -o option and either
> > the 'custom' or 'tar' format rather than straight SQL.
>
> Cool.  I could of sworn that you had to back them up seperately.  Was that
> the case at one time?
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
--
Ericson Smith <eric@did-it.com>

Re: Caching Websites

From

"scott.marlowe"

Date:

12 May 2003, 13:41:28

And if you're looking at performance caching, try SQUID.  It's complex to
setup for most stuff, but as a simple single site http accelerator, it's
pretty fast, and not nearly as hard to setup as when it is using a
redirector (i.e. multi-backend / multi-frontend setup).

On 12 May 2003, Ericson Smith wrote:

> Maybe a little out of the loop... but if you're caching website stuff
> (html?, xml?), then it might be best not to use the Database. If your DB
> goes down... your content site goes down too.
>
> I remember a project a little while back where we actually used plain
> ol, DBM files to cache the content. It was tens of times faster than the
> database, and would stay up no matter what.
>
> I see what your're saying about the LO's but IMHO, the DB is not the
> best place for cached content.
>
> - Ericson Smith
> eric@did-it.com
>
> On Mon, 2003-05-12 at 12:04, scott.marlowe wrote:
> > On 12 May 2003, Doug McNaught wrote:
> >
> > > "scott.marlowe" <scott.marlowe@ihs.com> writes:
> > >
> > > > The advantage to storing them in bytea or text with base64 is that
> > > > pg_dump backs up your whole database.
> > >
> > > It does with LOs too; you just have to use the -o option and either
> > > the 'custom' or 'tar' format rather than straight SQL.
> >
> > Cool.  I could of sworn that you had to back them up seperately.  Was that
> > the case at one time?
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> > http://archives.postgresql.org
>