In this blog, I look at some of the theoretical concepts behind Data Warehousing and how Data Warehousing helps in achieving Business Intelligence.
Data Warehouse (DW)
Data Warehouse (DW)
- Is a big store which stores ALL of your data.
- By All data it means that it can have anything ranging from DBs to Excel sheets to Outlook servers storing emails.
- DW and production DB are two different things. While production DB is ideally suited to support production apps, DW is suited for -
- making reporting easier
- combining that data across multiple systems
- reducing production load as production environments are not responsible to produce outputs such as reports and BI
- providing long term storage of data.
- Production DBs are not created for this purpose. Usually they follow the model of on-line transaction processing (OLTP). They are designed to be worked on single records with optimized and normalized data. This means, that duplications have been removed. So getting all data for a record might involve lots of joins. Hence they are not suitable for reporting. The other problem is doing such operations on production DBs can slow the production system down.
- How DW solves this problem?
- DW uses a different mechanism than OLTP. It uses OLAP - online analytical processing. In OLAP, number of tables are reduced thus reducing the number of joins and increasing the simplicity to retrieve that data.
- In DW we actually denormalize the data so that the structure is simple to work with. This results in duplication of data, however, in recent times the cost of data storage has gone down and that additional cost is worth more when it comes to the gains in reporting.
- When we denormalize the tables, we use a concept of surrogate key. This is a primary key to this new table which doesn't really have a meaning.
- Fact: Fact tables are used to mark some events with respect to time. Facts joins dimensions. Facts also answer the question of "how much". For example, Fact table will contain the answer to who did the transaction, what kind of transaction was done, where it was done and when it was done (Just the ids for this "w" questions). Fact table might also contain the amount ("how much") of that transaction.
- Dimension:
- Dimension contains the actual values of who, what, where and when of the event. They hold the values that describe the facts. For example, who might point to Customer table, where might point to Country table, what might point to some lookup table and when might point to some history table. There are various types of dimensions -
- Static Data: This is the data that never changes. For example WeekDays. A Sunday will always be Sunday. Such type of data is usually called a type 0 dimension.
- A type 1 dimension is that kind of data in which case history is not required to be saved. Its simply overwritten.
- A type 2 dimension is that kind of data in which case history is required to be saved. In this case we usually add 2 columns to the dimension table which contain the information of "from date" and "thru date". Initially, thru date is null, indicating that this record is current. When a new update record is inserted, we change the thru date to some date for the old record. Now the thru date of the new record becomes null, indicating that this is the current record.
- A type 4 dimension is that kind of data in which you want to keep the history and values separate.
- There might be cases where within 1 table you might have few different columns that have different type dimensions.
- Usually, developers should never decide the type dimension of a column. It should always be decided by business as they know the data better.
- A conformed dimension is a table which consolidates primary keys. For example, if we have 3 different systems each assigning different key to the same item, we first consolidate these 3 keys in a conformed dimension table. This is also a case where surrogate keys become helpful.
Business Intelligence (BI)
- Transforming all the data that is there in the data warehouse into something meaningful is business intelligence.
- One main example is to aggregate the data to produce reports which requires writing tools which can sift through large amounts of data.
- Another example would be to analyze the data by looking for patterns and trends.
- We might also want to do data mining on our data to come up with analysis that might not be that obvious from the data itself. For example, a customer buying a tennis racket is 95% likely to buy tennis balls, however, he is only 2% likely to buy a tennis net.
Constructing a DW/BI solution
- Constructing a DW/BI solution involves 4 steps. Also, in the braces you see some of the MS tools that help us in doing that-
- Creating a DW
- ETL - Extract - Transform - Load (SQL Service Integration Services SSIS)
- In ETL, Extraction means extracting data from the data stores
- Transformation means transforming the data (denormalization etc as mentioned above) to match the table design in DW
- Load means loading the data in DW
- Analytics - Aggregation - Trending - Correlations (SQL Server Analysis Services)
- Reporting (SQL Server Reporting Services)