Friday, May 05, 2006

Normalize Schnormalize (aka Real-Time Data Warehousing)

Some applications need real-time reporting, ecommStats Web Analytics is one such app. ecommStats customers want to see what is happening on their website and they want to see it now.

A little background might be in order: In general, with enterprise software, you move and transform data from your Transactional Database (the one that is involved in the day to day work and storing every little piece of data) to your Data Warehouse so that you can generate reports faster and easier, which basically boils down to Business Intelligence. And this is generally a heavy process so it will run once per day, once per week, or whatever fits into the business needs and the gap must be longer than it takes to actually run the process (if the process takes 24 hours to run, it's probably not a good idea to run it every day).

But why wait until the night time (the night time is the right time) or the weekend to move all that data into a separate database when you can do simple preemptive operations on your database to give your users real-time data?

Normalization is great and all, but a hybrid approach can work wonders. The data is still normalized, but we have extra tables and extra columns that act as our data warehouse.

Lets take the Search Phrase report for instance, it is a time based report which shows you how many people have searched on a particular search phrase or keyword. In a normalized system, you would have your search phrase in one table and the request that came in in another table (this is seriously simplified to make a point):


SearchPhrase
id
phrase


SearchRequest
id
phraseId
date
user



Now you could sum up all the requests that used the phrase everytime someone wants the Search Phrase report, but do you want to make your users wait for this report? There could be thousands of words and thousands of requests for each word which quickly puts you into the millions of rows! The user might as well go have a coffee while they're waiting. And do you want your poor server to have to work so hard for this little report?

But by adding a single column, we can make the report almost instantaneous:

SearchPhrase
id
phrase
requestCount
lastRequestDate



Now when we're showing that report, we don't even have to look at the SearchRequest table, so what used have to go through potentially millions of rows, now is just a few thousands rows (or however many search phrases you are looking at). Obviously this example won't give you the time based reporting, but using an extra time bucket table, with similar techniques can get the results you want.

And lets say you also wanted to show when the last time some searched for a particular phrase. You could look in the SearchRequest table, sort by date descending and then grab the first one, but this is time, disk and processor intense if the request table is large. So try simply adding a date column to the SearchPhrase table like above and update it when the request is made. Don't be afraid to duplicate data when the performance benefits can be substantial.

Now how much harder is this to implement in your code? It's actually very simple in most cases, usually just requiring an extra update. In the example above, when saveSearchRequest(SearchPhrase phrase) is called in the code, it will save the SearchRequest, and do a simple update on the requestCount column:
UPDATE SearchPhrase SET requestCount = requestCount + 1, lastRequestDate = sysdate where id = PHRASE_ID;


This one time update is insignificant when compared to the many times that queries will be run against your tables.

In conclusion, using these hybrid techniques for database design can reduce wait time significantly and you can report in near real-time, rather than making your users wait days for their data. The new columns and/or tables may take up more space, but space is cheap and peformance is not. It truly is a small price to pay when compared to the huge price of doing this repeatedly in a purely normalized database or a non-real-time huge bulk process to move the data into a data warehouse.

Tags: | | | |