Re: Table Design question for gurus (without going to "NoSQL")... - Mailing list pgsql-general

From David Johnston
Subject Re: Table Design question for gurus (without going to "NoSQL")...
Date
Msg-id 7475F566-AAEF-48CF-9B13-65FC7C398C9A@yahoo.com
Whole thread Raw
In response to Table Design question for gurus (without going to "NoSQL")...  (Phoenix Kiula <phoenix.kiula@gmail.com>)
List pgsql-general
On Nov 20, 2011, at 8:33, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

> Hi.
>
> Want to start another thread, loosely related to the performance
> problems thread I have going.
>
> Need some DB design guidance from the gurus here.
>
> My big table now has about 70 million rows, with the following columns:
>
>
> alias           | character varying(35)
> url             | text
> modify_date     | timestamp without time zone
> ip              | bigint
>
>

While the schema is useful you need to provide HOW the data is being used if you want to help on finding ways to
improveperformance. 

> For each IP address (user of my application) I want to have a unique
> URL. So I used to have a UNIQUE constraint on IP, URL.

Give a base URL can you encode an algorithm to generate the user-specific URL on-demand; then maybe cache that result
inthe application. 

> But the index
> based on this became huge, as some URLs are gigantic. so I introduced

What does this mean?  Are there any patterns to the URLs that you can leverage (like, say, grouping them by domain
name)? Is there a lot of overlap between users so that having a URL table with a biting PK would make a difference? 


> an md5 of the URL:
>
>
> url_md5             | varchar(32)
>
>
> I now have two scenarios:
>
> 1. To have an index (unique?) on "(ip, url_md5)"
>
> 2. To not have an index on just the "ip". This way a query that tries
> to match   "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will
> still look only at the ip bit of the index, then refine it with the
> url_md5.
>
> The good thing about #2 is the size of index remains very small with
> only a bigint field (ip) being indexed.
>
> The bad thing about #2 is that each query of "...WHERE ip = 999 AND
> url_md5 = '<md5 here>'..."  will have to refine the indexed IP. If one
> IP address has put in a lot of URLs, then this becomes a bit slow. As
> is now happening, where I have users who have over 1 million URLs
> each!

Create a additional partial index on the URL for any IP address with more than X number of records?  You smallish users
theonly need to use the IP. Index while the big ones use that PLUS their personal URL index. 

>
> Questions:
>
> 1. Instead of md5, is there any integer hashing algorithm that will
> allow me to have a bigint column and save a lot hopefully in both
> storage space and speed?  (Some very useful points mentioned here:
> http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
> )
>
> 2. If I do go with the above scenario #1 of a joint index, is there
> any way I can save space and maintain speed? Partitioning etc are out
> of the question.
>

If you are going to discount the feature whose implementation solves this specific problem then you are basically
askingthe list to solve your specific problem and, from my comment above, to do so without providing sufficient details
asto how your application works. 

Also, WTF do you mean by "etc".  If you are going to discount something from consideration you should be able to
exactlyspecify what it is. 

Furthermore, if you ask the question and exclude possible solutions you should explain why you cannot use them so that
peoplewill not propose other solutions that would have the same faults. 

> With a growing web database, I am sure many people face this
> situation. Are nosql type databases the only sane solution to such
> massive volumes and throughput expectations (e.g., CouchDb's MemBase)?

You would implement these before you would partition?

There are likely multiple solutions to your problem but, again, simply giving a table schema doesn't help it
determiningwhich ones are feasible. 

>
> Many thanks for any ideas or pointers!
>

The only data ignorant, and thus generally useful, PostgreSQL solution is table partitioning.

Use It.

My other questions, while an interesting thought exercise, need intimate knowledge of the data to even evaluate if they
makesense. 

So, in short, use partitions.  If you cannot, provide reasons why and then include more details about the application
anddata so that meaningful solutions have a chance to be suggested. 

David J.




pgsql-general by date:

Previous
From: Gregg Jaskiewicz
Date:
Subject: Re: Table Design question for gurus (without going to "NoSQL")...
Next
From: Tom Lane
Date:
Subject: Re: 9.1.1 build failure : postgres link fails