Re: Physical sites handling large data - Mailing list pgsql-general
From | Shridhar Daithankar |
---|---|
Subject | Re: Physical sites handling large data |
Date | |
Msg-id | 3D84AC69.1006.13BDF7AC@localhost Whole thread Raw |
In response to | Re: Physical sites handling large data (Justin Clift <justin@postgresql.org>) |
Responses |
Re: Physical sites handling large data
|
List | pgsql-general |
On 14 Sep 2002 at 9:39, Justin Clift wrote: > Hi Scott, > > Good move. :) > > Shridhar, any idea of the kind of demands they'll be placing on the > database? First of all, thanks you all guys for your quick and helpful responses. Robert E. Bruccoleri send me his sites description which gave me an idea what postgresql can do at that scale. I spent more than 3 hours looking for urls of such a large installation. Didn't get a single one via google(May be I am a bad search person..) Now from what Robert E. Bruccoleri tells me, there is a bigger installation of postgres than mentioned anywhere else. I would request people to put up some information if they hav such an installation. I would do the same subject to permission from client. Postgres deserves this sort of publicity. Such information is very crucial when it comes to convince government bodies to consider open source alternatives to commercial ones, thr. LUGs etc.. I understand that designing database for such a site would require detailed knowledge of application. I am persuing my friend/colleague to get as much information to get out of it.. Now to answer your queries.. > For example, does your friend have an idea of: > > a) how many clients will be simultaneously connecting to the database(s) > normally, and at peak times AFAIK, this job is like analysis of log data(telecom domain). So number of clients would not be big if there is no/small parallalism to be extracted at application level. Most importantly number of clients will not be fluctuating much.. So it gives some deterministic levels of prediction, to say so.. > b) how sensitive to performance lag and downtime is the application? Hardly.. This is a replica of production system. Data loss is not going to be an issue. Of course if the database pulls wrong data and hence wrong results that is obviously unacceptable.. But I guess things won't come down to that level.. There are not set performance goals as of now. Obviously faster is better.. > c) what are the data integrity requirements? Large array's of data that > are mission critical need to be treated differently than small arrays, > especially when taking b) into consideration. Higher end non-intel > servers generally have better features in their OS and hardware for > dealing with large amounts of important data. IMO if I have to write a evaluation and proposal for this task, intel hardware won't feature in just because it does not have a proven 64 bit CPU. Personally I would recommend HP hardware knowing the client's profile.. > d) what kind of stuff is your friend familar with? For example, is he > ok with unix in general, etc? That's OK. These guys are into HP-UX heavily.. That's not a problem. Postgres has figured on their radar to save licensing costs towards oracle installation. > "scott.marlowe" wrote: > > Often times the real limiter for database performance is IO bandwidth and > > subsystem, not the CPUs. After that memory access speed and bandwidth are > > very important too, so I can see a big HP UX box beating the pants off of > > a Xeon. Perfectly agreed. Slightly OT here. The test I referred to in my OP, xeon machine had mirrored SCSI raid and 533MHz FSB. HP-UX box was single PA- 8700/750MHz CPU with single SCSI disk. IMO even bandwidth consideration were in xeon's favour. Only things in favour of HP box were the 3MB on chip cache and 64 bit RISC CPU. If that makes so much of a difference, intel machines won't stand a chance for a long time IMO.. > > My point being, spend more money on the drive subsystem than anything else > > and you'll probably be fine, but postgresql may or may not be your best > > answer. It may be better to use something like berkeley db to handle this > > job than a SQL database. Agreed but with few differences... I am going to divide my pitch as follows. 1) Can postgresql can do the job? Yes. Refer to FAQ for limits of postgresql and real world installations like posted by Robert E. Bruccoleri in person, as example. 2) Why postgresql? a) It can do the job as illustrated by 1. b) Architecture of postgres Some strong points, i) It's a most complete and/or useful open source SQL implementation. If they want, they can customise it as they want later. Using berkeley DB might do good at this level but depending upon complexity of application(Which I don't know much), I would not rather put it. ii) Does not offer idiocies/features that would limit implementations. e.g. since it relies on OS to take care of storage, it will run with same peace on IDE disk to fiber array(or what ever highest end tech. available). One need not wait till you get storage driver from database vendors. Less unneeded features==Cleaner implementation iii) Since table is split in multiples of 1GB, no upper limit on table size and/or splitting the table across storages etc.. (Hand tweaking basically) Some weak points(These guys are considering distributed databases but won't mind spending on hardware if the proposal is that worth.) i) A database can not span mulptiple machines. So clustering is out. If data is split in multiple databases on multiple machines, application will have to do merging etc. Any pointers on this? Partitioning etc? ii) No *out of box* replication. (OK I can take down this point but when mysql pops up, I got to include this for fair comparison.) iii) Being a process driven architecture, it can not process data in parallel even if possible. e..g say a table is 100GB in size. So split across 100 segments on file system. But it can not return data from all 100 segments simaltaneously because there is one process per connection. Besides it won't be able to spread any computational load across multiple CPUs. 3) How to do it? a) Get a 64 bit architecture. Depending upon computational requirements and projected number of connections, add the CPUs. Start with one CPU. I guess that would be enough given how good PA-Risc CPUs are. (OK I haven't used anything else but PA-Risc CPUs look good to me for data manipulation) b) Get loads of RAM. 4-8GB sounds good to me. May be even better. On this topic, say I decide to devote 8GB to postgres that means 1048576 buffers for 8K page size. Need I push available number of shared memory segments beyond this value? Can I tweak size of page? If I can would it help for such an installation? c) Get a *fast* storage. I would rather not elaborate on this point because the client is suppose to know better being in telecom business. But would offer my inputs based upon information from Scott etc. But I don't think it would come to that. Any suggestions? Modifications? This is a rough plot. Will finalise today evening/tom morning.. ( I am from India BTW, just to let you know my time zone..;-)). I hope I haven't missed any point I have thought of in last two days.. Once again, thanks a lot for the help offered. I can not put in words how on point all this thread has been.. Bye Shridhar -- DeVries' Dilemma: If you hit two keys on the typewriter, the one you don't want hits the paper.
pgsql-general by date: