DIMENSIONAL MODELING
https://dwbi.org/data-modelling/dimensional-model/1-dimensional-modeling-guide
22.What is Dimensional Modelling?
- Dimensional Modelling is a design concept used by many datawarehouse designers to build their datawarehouse.
- In this design model all the data is stored in two types of tables
- – Facts table and Dimension table.Fact table contains the facts/measurements of the business and
- the dimension table contains the context of measurements ie, the dimensions on which the facts are calculated.
- Dimensional model consists of dimension and fact tables.
- Fact tables store different transactional measurements and the foreign keys from dimension tables that qualifies the data.
- The goal of Dimensional model is not to achieve high degree of normalization but to facilitate easy and faster data retrieval.
- Faster Data Retrieval
- Better Understandability
- Extensibility
- Dimensional modeling is often used in Data warehousing.A snow and star flake schema represent data modeling
.Data Modeling always uses two types of tables you can say it as facts and dimensions tables.
.Facts tables are the tables which contains numerical value which can be aggregated and analyzed on fact values.Dimension defines hierarchies and description of fact values
Dimensional model is a methodology that consists of “dimensions” and “fact tables”.
Fact tables are used to store various transactional measurements from “dimension tables” that qualifies the data.
23.Step by Step Approach to Dimensional Modeling
- Step 1: Identify the dimensions
- Step 2: Identify the measures
- Step 3: Identify the attributes or properties of dimensions
- Step 4: Identify the granularity of the measures
- Step 5: History Preservation (Optional)
Dimensional Modeling Schema
3 different schema, namely – Star, Snowflake and Conformed
Storing hierarchical information in dimension tables
check hierarchy section
24. What is Data Modeling?
- Data Modeling is nothing but the database design technique which is useful to fetch different types of complex sql queries in DWH environment.
- The Data Modeling technique is nothing but representation of Logical data model and physical data model according to the Business requirements.
- Data Modeling is most important Design technique which used to support the users in data warehousing.
- Data Modeling always uses two types of tables you can say it as facts and dimensions tables.
- Facts tables are the tables which contains numerical value which can be aggregated and analyzed on fact values.Dimension defines hierarchies and description of fact values.
Data cleaning:
to provide a accurate data to DW.
identify incomplete,incorrect or inaccurate data then correct the data or modify the data using the business rule sand also provide the default when values for missing data.
We make sure the data is consistent before we load the data DW.
Like….trimming, lowercase, replacing null, date formats, deleting the typing mistake
25.Why is Data Modeling Important?
- The goal of the data model is to make sure that the all data objects required by the database are completely and accurately represented.
- Because the data model uses easily understood notations and natural language , it can be reviewed and verified as correct by the end-users.
- In computer science, data modeling is the process of creating a data model by applying a data model theory to create a data model instance.
- A data model theory is a formal data model description.
- When data modelling, we are structuring and organizing data.
- These data structures are then typically implemented in a data base management system.
- In addition to defining and organizing the data,data modeling will impose (implicitly or explicitly) constraints or limitations on the data placed within the structure.
- Managing large quantities of structured and unstructured data is a primary function of information systems.
26.What is data modeling? Explain what is included in data warehouse modeling?
- A data model is a conceptual representation of business requirement (logical data model) or database objects (physical) required for a database and are very powerful in expressing and communicating the business requirements and database objects.
- The approach by which data models are created is called as data modeling
- A data modeling is a process of representing the data view in the form of a graphical way. So within the data modeling process, the following activities are included:
- Designing data warehouse database in detailFollows all the principles and patterns of data warehousing and business intelligence.
- They are several data modeling tools that are available where the data can be displayed in the best graphical way.
The data warehouse modeling includes:
- Top-down drive approach
- Fact tables and dimensions tables
- Multidimensional model or often called star schema
- Normalization and denormalization
27.Explain Common Errors you faced while doing data modeling?
There are following common errors user will face while doing data modeling:
- Building Large Data Models: Building large data models is not a good practice. If tables are more than 200 the data model become more and more complex. In that case that data model will fail.
- Unnecessary Surrogate keys: Surrogate key should not be used unnecessarily. Use surrogate key only when the natural key cannot serve the purpose of a primary key.
- Purpose Missing: There are so many situations where user does not know the purpose of the business. If user does not have proper understanding of the specified business there is no way to create specified data model for that business. So there is need to know the purpose of the business before creating data model.
- Inappropriate Denormalization : Don’t denormalize until and unless you have a solid & clear business reason to do so because de-normalization creates redundant data which is difficult to maintain.
28.Common Mistakes in Data Modelling
Avoid Large Data Models
Building massive data models: Large data models are like to have more design faults. Try to restrict your data model to not more than 200 tables.
Watch carefully if you have following signs in your data model?
- Lots of entities with no or very few non-key attributes?
- Lots of modelling objects with names which no business user would recognize?
- You yourself have lot of troubles coming up with the names of the attributes?
in the early phase of data modelling ask yourself these questions –
- Is the large size really justified?
- Is there any extraneous content that I can remove?
- Can I shift the representation and make the model more concise?
- How much work is to develop the application for this model and is that worthwhile?
- Is there any speculative content that I can remove? (Speculative contents are those which are not immediately required but still kept in the model as “might be required” in the future.)
Lack of Clarity or Purpose
Lack of purpose: If you do not know that what is your business solution is intended for, you might come up with an incorrect data model. So having clarity on the business purpose is very important to come up with a right data model.
Reckless violation of Normal Form
Traps in Dimensional Modelling
Snow-flaking between two Type-II slowly changing dimension (SCD) tables
Theoretically speaking there is no issue with such a model, at least until one tries to create the ETL programming (extraction-transformation-loading) code behind these tables.
Consider this – in the above example, suppose something changed in the “ProductType” table which created a new row in “ProductType” table (since ProductType is SCD2, any historical change will be maintained by adding new row). This new row will have new surrogate key. But in the Product table, any existing row is still pointing to the old product type record and hence leading to data anomaly.
Indiscriminate use of Surrogate keys
Surrogate key should not be used unnecessarily. Use surrogate key only when the natural key cannot serve the purpose of a primary key.
Surrogate Keys are used as a unique identifier to represent an entity in the modeled world. Surrogate keys are required when we cannot use a natural key to uniquely identify a record or when using a surrogate key is deemed more suitable as the natural key is not a good fir for primary key (natural key too long, data type not suitable for indexing etc.)
But surrogate keys also come with some disadvantages. The values of surrogate keys have no relationship with the real world meaning of the data held in a row. Therefore over usage of surrogate keys (often in the name of “standardization”) lead to the problem of disassociation and creates unnecessary ETL burden and performance degradation.
Even query optimization becomes difficult when one disassociates the surrogate key with the natural key. The reason being – since surrogate key takes the place of primary key, unique index is applied on that column. And any query based on natural key identifier leads to full table scan as that query cannot take the advantage of unique index on the surrogate key.
Before assigning a surrogate key to a table, ask yourself these questions –
- Am I using a surrogate key only for the sake of maintaining standard?
- Is there any unique not null natural key that I can use as primary key instead of a new surrogate key?
- Can I use my natural key as primary key without degrading the performance?
If the answer of the above questions are ‘YES’ – do yourself a favor. Don’t use the surrogate key.
Unnecessary de-normalization: Don’t denormalize until and unless you have a solid & clear business reason to do so because de-normalization creates redundant data which is difficult to maintain
29. What is OLTP data modeling?
- OLTP acronym stands for ONLINE TRANSACTIONAL PROCESSING.
- The approach by which data models are constructed for transactions is called as OLTP data modeling. Example: all online transactions, bank transactions, trading transactions.
30.Performance Considerations for Dimensional Modeling
Surrogate Key is not mandatory
Keep it simple – nothing succeeds like simplicity.
Ask yourself if you really require the plethora of those attributes that you have kept in your dimension table? Or do you really need that additional table that you are thinking to introduce in your model?
- Additional Programming effort – mapping effort from source to staging and staging to data warehouse
- Additional testing effort – 1 additional test case to be written and tested by the testers to check the correctness of the data
- Additional space requirement – If your fact table is going to contain 100 million records in next 2/3 years (which is not a big number by any standard), it’s going to take following extra space.
- This is because introduction to surrogate keys in dimension tables necessitates additional lookup operation when doing the fact data loading and lookup is a very costly affair, if not the most costly affair.
- Keep it simple – nothing succeeds like simplicity
-
Reduce Complexity
-
Reducing pre-calculated values
-
Removing Unnecessary Constraints
-
-
Stop snow-flaking
- Choose the attributes of SCD Type 2 dimensions judiciously
In a typical implementation, SCD Type 2 tables preserve the history by means of adding new row to the table whenever a change happens. Because of this property, SCD type 2 tables tend to grow larger and larger day by day thereby degrading the performance. This is a problem because growth of dimension tables hurts the performance even more than the growth of fact table as dimension tables are typically conformed and used across different schema. In order to control this growth, it is very important that we consider only those columns in SCD implementation, which we really want to preserve the history of.
This means, if you have one hundred columns/entities in the source system for a specific table, don’t start tracking SCD changes for all of them. Rather, carefully consider the list of the columns where you really need to preserve the history – track only those as Type 2 in your SCD implementation and just consider the other columns as Type 1 (so that even if they change, no row is added to the table
- Don’t Create a Snapshot Fact and a SCD Type 2 table for the same purpose
This is obviously a little complex to explain. A snapshot fact always show the latest (or last known) state of the measures. The latest records of a SCD type 2 dimension also do the same. The only difference is a fact shows the state of the measures whereas a SCD Type 2 table shows the state of the attributes. But there are cases where this difference can become very blurry.
Where do you store the attributes such as “No of telephone lines” of one customer or “No of bank accounts” of one customer? Is number of telephone lines of a customer an attribute of customer itself? Not quite. Then obviously we won’t store it in dimension table as we need to store this in fact table. But here comes the problem – all such records in fact table get different surrogate keys for the same customer since the key comes from a SCD Type 2 table. Because of such key variance, it is impossible to join two such facts in a single query if those facts are not loaded in the same time. To preserve the key invariance, the dimension table needs to be self-joined twice in the middle and this causes lot of performance issues
- Consider Indexing Carefully
- Consider Partitioning
- Avoid Fact to Fact Join
- Give some attention to your server infrastructure
Dimensional modeling (DM) is the name of a set of techniques and concepts used in data warehouse design. According to data warehousing consultant Ralph Kimball, DM is a design technique for databases intended to support end-user queries in a data warehouse. It is oriented around understand ability and performance. Dimensional modeling always uses the concepts of facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts.