Thread: postgresql.org Weblogs?
Dave, Marc, Magnus etc. I, GreenPlum, JasperSoft, and others are setting up a demo of data warehousing on Bizgres/PostgreSQL for OSCON. The demo will involve doing sophisticated reporting on clickstream (weblog) data. We've asked a couple of high-profile web sites for their weblog data for this demo, but due to corporate beaurocracy, they may not come through in time. Would it be possible for us, you think, to use the weblogs of some of the PostgreSQL.org sites? The end product will be OSS and will run on PostgreSQL. What we're looking for is 2+weeks of web logs in extended format. We want the PostgreSQL.org data as a backup in case neither of these big-name web sites comes through. Does this sound possible? Thanks! -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Sure ... there is nothing confidential in them, just let us know what you want (if you want it) ... On Mon, 11 Jul 2005, Josh Berkus wrote: > Dave, Marc, Magnus etc. > > I, GreenPlum, JasperSoft, and others are setting up a demo of data > warehousing on Bizgres/PostgreSQL for OSCON. The demo will involve doing > sophisticated reporting on clickstream (weblog) data. > > We've asked a couple of high-profile web sites for their weblog data for > this demo, but due to corporate beaurocracy, they may not come through in > time. Would it be possible for us, you think, to use the weblogs of some > of the PostgreSQL.org sites? The end product will be OSS and will run on > PostgreSQL. > > What we're looking for is 2+weeks of web logs in extended format. We want > the PostgreSQL.org data as a backup in case neither of these big-name web > sites comes through. Does this sound possible? > > Thanks! > > -- > --Josh > > Josh Berkus > Aglio Database Solutions > San Francisco > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
Marc, > Sure ... there is nothing confidential in them, just let us know what > you want (if you want it) ... Yeah, I thought I'd ask now, because presumably you don't keep the weblogs indefinitely. We'd like 2-or-more-weeks worth with one of the following formats: ============================== Extended/Combined log format is ok but cookies are not included but for a demo it should work ok. The inclusion of the user agent reduces merges but only to a certain extent. Fields ------ remotehost rfc931 authuser [date] "request" status bytes "referer" "user_agent" NCSA Combined Log Format, W3CA format or the combined plus cookies are the most useful. So you end up with Fields ------ remotehost rfc931 authuser [date] "request" status bytes "referer" "user_agent" In-cookie Out-cookie Server (optional) ================================== --Josh -- --Josh Josh Berkus Aglio Database Solutions San Francisco
This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-368409868-1121114493=:1788 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Mon, 11 Jul 2005, Josh Berkus wrote: > Marc, > >> Sure ... there is nothing confidential in them, just let us know what >> you want (if you want it) ... > > Yeah, I thought I'd ask now, because presumably you don't keep the weblogs > indefinitely. Actually ... unless a server crashes so that we lose them, we do ... svr1.postgresql (developer) has stuff going back to May 2004 ... we used ot have www.postgresql.org *way* back before all the moving around ... pgfoundry.org goes back to May 7th, 2004 ... I'm a pack rat :) > Extended/Combined log format is ok but cookies are not included but for > a demo it should work ok. The inclusion of the user agent reduces merges > but only to a certain extent. > > Fields > ------ > remotehost > rfc931 > authuser > [date] > "request" > status > bytes > "referer" > "user_agent" > > NCSA Combined Log Format, W3CA format or the combined plus cookies are > the most useful. > > So you end up with > > Fields > ------ > remotehost > rfc931 > authuser > [date] > "request" > status > bytes > "referer" > "user_agent" > In-cookie > Out-cookie > Server (optional) Can you send me a CustomLog entry for apache that I can add to the configuration, that lays things out exactly as you want it? ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664 --0-368409868-1121114493=:1788--
> > Marc, > > > >> Sure ... there is nothing confidential in them, just let > us know what > >> you want (if you want it) ... > > > > Yeah, I thought I'd ask now, because presumably you don't keep the > > weblogs indefinitely. > > Actually ... unless a server crashes so that we lose them, we do ... > svr1.postgresql (developer) has stuff going back to May 2004 > ... we used ot have www.postgresql.org *way* back before all > the moving around ... > pgfoundry.org goes back to May 7th, 2004 ... I'm a pack rat :) Hmm, IIRC the logs on all the static mirrors are going to /dev/null, for performance reasons. Specifiaclly the FreeBSD jails couldn't deal with the write activity. We got like 10-15 times better performance after disabling it. Got significantly better on the other machines as well (linux mirrors), but the freebsd jails was the main reason we set up that policy. With logging enabled, all the servers just fell over. IIRC, this includes wwwmaster, which we also don't have logs for. If the data is good enough, go for the pgfoundry ones. Then consider disabling the logging there as well to see if it helps with the performance issues ;-) Or use the old logs - I s'pose to test your systems you don't need "up to date" logs? //Magnus
On Tue, 12 Jul 2005, Magnus Hagander wrote: >>> Marc, >>> >>>> Sure ... there is nothing confidential in them, just let >> us know what >>>> you want (if you want it) ... >>> >>> Yeah, I thought I'd ask now, because presumably you don't keep the >>> weblogs indefinitely. >> >> Actually ... unless a server crashes so that we lose them, we do ... >> svr1.postgresql (developer) has stuff going back to May 2004 >> ... we used ot have www.postgresql.org *way* back before all >> the moving around ... >> pgfoundry.org goes back to May 7th, 2004 ... I'm a pack rat :) > > Hmm, IIRC the logs on all the static mirrors are going to /dev/null, for > performance reasons. Specifiaclly the FreeBSD jails couldn't deal with > the write activity. We got like 10-15 times better performance after > disabling it. Got significantly better on the other machines as well > (linux mirrors), but the freebsd jails was the main reason we set up > that policy. With logging enabled, all the servers just fell over. IIRC, > this includes wwwmaster, which we also don't have logs for. Ya, that's why we've started to work on moving to 5.x ... I'm getting really tired of the "unsupported version" :( ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664