Table Design question for gurus (without going to "NoSQL")... - Mailing list pgsql-general

From Phoenix Kiula
Subject Table Design question for gurus (without going to "NoSQL")...
Date
Msg-id CAFWfU=sS4C1-FT=EyxjKAtWzqoxMHrDL7+o_unWDbUB9kNwQFg@mail.gmail.com
Whole thread Raw
Responses Re: Table Design question for gurus (without going to "NoSQL")...  (Phoenix Kiula <phoenix.kiula@gmail.com>)
Re: Table Design question for gurus (without going to "NoSQL")...  (David Johnston <polobo@yahoo.com>)
Re: Table Design question for gurus (without going to "NoSQL")...  (Gavin Flower <GavinFlower@archidevsys.co.nz>)
Re: Table Design question for gurus (without going to "NoSQL")...  (Ondrej Ivanič <ondrej.ivanic@gmail.com>)
List pgsql-general
Hi.

Want to start another thread, loosely related to the performance
problems thread I have going.

Need some DB design guidance from the gurus here.

My big table now has about 70 million rows, with the following columns:


 alias           | character varying(35)
 url             | text
 modify_date     | timestamp without time zone
 ip              | bigint


For each IP address (user of my application) I want to have a unique
URL. So I used to have a UNIQUE constraint on IP, URL. But the index
based on this became huge, as some URLs are gigantic. so I introduced
an md5 of the URL:


 url_md5             | varchar(32)


I now have two scenarios:

1. To have an index (unique?) on "(ip, url_md5)"

2. To not have an index on just the "ip". This way a query that tries
to match   "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will
still look only at the ip bit of the index, then refine it with the
url_md5.

The good thing about #2 is the size of index remains very small with
only a bigint field (ip) being indexed.

The bad thing about #2 is that each query of "...WHERE ip = 999 AND
url_md5 = '<md5 here>'..."  will have to refine the indexed IP. If one
IP address has put in a lot of URLs, then this becomes a bit slow. As
is now happening, where I have users who have over 1 million URLs
each!

Questions:

1. Instead of md5, is there any integer hashing algorithm that will
allow me to have a bigint column and save a lot hopefully in both
storage space and speed?  (Some very useful points mentioned here:
http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
)

2. If I do go with the above scenario #1 of a joint index, is there
any way I can save space and maintain speed? Partitioning etc are out
of the question.

With a growing web database, I am sure many people face this
situation. Are nosql type databases the only sane solution to such
massive volumes and throughput expectations (e.g., CouchDb's MemBase)?

Many thanks for any ideas or pointers!

pgsql-general by date:

Previous
From: Phoenix Kiula
Date:
Subject: Re: Huge number of INSERTs
Next
From: Rob Sargentg
Date:
Subject: 9.1.1 build failure : postgres link fails