- Hybrid SCDs are a combination of both SCD 1 and SCD 2.
- It may happen that in a table, some columns are important and we need to track changes for them i.e., capture the historical data for them whereas in some columns even if the data changes, we do not have to bother.
- For such tables, we implement Hybrid SCDs, where in some columns are Type 1 and some are Type 2.
53.What is a mini dimension?
Mini dimensions can be used to handle rapidly changing dimension scenario. If a dimension has a huge number of rapidly changing attributes it is better to separate those attributes in different table called mini dimension. This is done because if the main dimension table is designed as SCD type 2, the table will soon outgrow in size and create performance issues. It is better to segregate the rapidly changing members in different table thereby keeping the main dimension table small and performing.
55.Whether Dimension table can have numeric value?
Yes, dimension table can have numeric value as they are the descriptive elements of our business.
56.What is a core dimension?
Core dimension is nothing but a Dimension table which is used as dedicated for single fact table or datamart
57.How are the Dimension tables designed?
- Most dimension tables are designed upto 2NF.
- In some instances they are further normalized to 3NF.
- Find where data for this dimension are located.
- Figure out how to extract this data.
- Determine how to maintain changes to this dimension
- certain data may not be appropriate to store in the schema, this data (or attributes) can be stored in a junk dimension.
- The nature of data of junk dimension is usually Boolean or flag values.
- A single dimension is formed by lumping a number of small dimensions.
- Junk dimension has unrelated attributes.
- A junk dimension is a grouping of typically low-cardinality attributes (flags, indicators etc.)
- so that those can be removed from other tables and can be junked into an abstract dimension table.
- These junk dimension attributes might not be related. The only purpose of this table is to store all the combinations of the dimensional attributes which you could not fit into the different dimension tables otherwise.
- Junk dimensions are often used to implement Rapidly Changing Dimensions in data warehouse.
- For example, it can be member eligibility flag set as ‘Y’ or ‘N’ or any other indicator set as true/false, any specific comments, etc. if we keep all such indicator attributes in the fact table then its size gets increased. So, we combine all such attributes and put in a single dimension table called as junk dimension having unique junk IDs with a possible combination of all the indicator values.
- It is a dimension table comprising of attributes that don’t have a place in the fact table or in any of the current dimension tables. Generally, these are the properties like flags or indicators.
SCD (Slowly changing dimensions), are the dimensions in which the data changes slowly, rather than changing regularly on a time basis.
Three types of SCDs are used in data warehousing, which are defined as:
- – SCD1: It is a record that is used to replace the original record even there is only one record existing in the database. The current data will be replaced and the new data will take its place.
- – SCD2: It is the new record file that is added to the dimension table. This record exists in the database with the current data and previous data that is stored in the history.
- – SCD3: This uses the original data that is modified to the new data. This consists of two records: one record that exist in the database and another record that will replace the old database record with the new information.
- Time dimensions are usually loaded by a program that loops through all possible dates appearing in the data.
- It is not unusual for 100 years to be represented in a time dimension, with one row per day.
61.What are the Different methods of loading Dimension tables?
- Conventional Load:Before loading the data, all the Table constraints will be checked against the data.
- Direct load:(Faster Loading)All the Constraints will be disabled. Data will be loaded directly.Later the data will be checked against the table constraints and the bad data won’t be indexed.
62.can a dimension table contains numeric values?.
- Yes.But those datatype will be char (only the values can numeric/char).
- these are descriptive elements of ur business
63. What is degenerate dimension table?
- If a table contains the values, which r neither dimension nor measures is called degenerate dimensions.
- Ex : invoice id,empno
64.What is VLDB?
- The perception of what constitutes a VLDB continues to grow.
- A one terabyte database would normally be considered to be a VLDB.
65. What is a dimension table?
- A dimensional table is a collection of hierarchies and categories along which the user can drill down and drill up.
- it contains only the textual attributes.
66.What are the different types of dimension?
In a data warehouse model, dimension can be of following types,
- Conformed Dimension
- Junk Dimension
- Degenerated Dimension
- Role Playing Dimension
Based on how frequently the data inside a dimension changes, we can further classify dimension as
- Unchanging or static dimension (UCD)
- Slowly changing dimension (SCD)
- Rapidly changing Dimension (RCD)
67.What is a ‘Conformed Dimension’?
- A conformed dimension is the dimension that is shared across multiple subject area. Consider ‘Customer’ dimension.
- Both marketing and sales department may use the same customer dimension table in their reports. Similarly, a ‘Time’ or ‘Date’ dimension will be shared by different subject areas.
- Theoretically, two dimensions which are either identical or strict mathematical subsets of one another are said to be conformed.
- For example, if subscriber dimension is connected to two fact tables – billing and claim then the subscriber dimension would be treated as conformed dimension
- It might be utilized with different fact tables in a single database or over numerous data marts/warehouses
A Dimension that is utilized as a part of different areas is called as conformed dimension. It might be utilized with different fact tables in a single database or over numerous data marts/warehouses. The Confirmed dimension is the dimension which is connected to two or more fact tables.
Real Life Example:
If Order table is connected to product fact as well as Order item fact and user needs to fetch data from both tables then Order dimension will be joined to Product as well as Order item table.
68.What is degenerated dimension?
A degenerated dimension is a dimension that is derived from fact table and does not have its own dimension table.
A dimension key, such as
- transaction number,
- receipt number,
- Invoice number
etc. does not have any more associated attributes and hence can not be designed as a dimension table.
- A degenerated dimension is a dimension which is not a fact but presents in the fact table as a primary key.
- It does not have its own dimension table.
- We can also call it as a single attribute dimension table.
But, instead of keeping it separately in a dimension table and putting an additional join, we put this attribute in the fact table directly as a key.
Since it does not have its own dimension table, it can never act a foreign key in fact table.
69.What is the number of child tables that can be created out from a single parent table?
The number of child tables that can be created that can be created out of the single parent table is equal to the number of fields/columns in the parent table that are non-keys
70.What do you understand by dimension and attribute?
- Dimensions represent qualitative data.
- For example– plan, product, class are all dimensions.
- A dimension table contains descriptive or textual attributes.
- For example, product category & product name are the attributes of product dimension
71.What is a role-playing dimension?
- Dimensions are often reused for multiple applications within the same database with different contextual meaning.
- For instance, a “Date” dimension can be used for “Date of Sale”, as well as “Date of Delivery”, or “Date of Hire”.
For example, a date dimension can be used for “Date of Claim”, “Billing date” or “Plan Term date”. So, such a dimension will be called as Role playing dimension. The primary key of Date dimension will be associated with multiple foreign keys in the fact table.
72.What is rapidly changing dimension?
- This is a dimension where data changes rapidly.
73.Describe different types of slowly changing Dimension (SCD)
A Type 0 dimension is where dimensional changes are not considered. This does not mean that the attributes of the dimension do not change in actual business situation. It just means that, even if the value of the attributes change, history is not kept and the table holds all the previous data.
For example, Subscriber’s DOB is a type-0 SCD because it will always remain the same irrespective of the time.
A type 1 dimension is where history is not maintained and the table always shows the recent data. This effectively means that such dimension table is always updated with recent data whenever there is a change, and because of this update, we lose the previous values.
For example, Subscriber’s address (where the business requires to keep the only current address of subscriber) can be a Type-1 dimension.
A type 2 dimension table tracks the historical changes by creating separate rows in the table with different surrogate keys. Consider there is a customer C1 under group G1 first and later on the customer is changed to group G2. Then there will be two separate records in dimension table like below,
|Key||Customer||Group||Start Date||End Date|
|1||C1||G1||1st Jan 2000||31st Dec 2005|
|2||C1||G2||1st Jan 2006||NULL|
Note that separate surrogate keys are generated for the two records. NULL end date in the second row denotes that the record is the current record. Also note that, instead of start and end dates, one could also keep version number column (1, 2 … etc.) to denote different versions of the record.
For example,Subscriber’s address (where the business requires to keep a record of all the previous addresses of the subscriber). In this case, multiple rows for a subscriber will be inserted in the table with his/her different addresses.
A type 3 dimension stored the history in a separate column instead of separate rows. So unlike a type 2 dimension which is vertically growing, a type 3 dimension is horizontally growing. See the example below,
|Key||Customer||Previous Group||Current Group|
This is only good when you need not store many consecutive histories and when date of change is not required to be stored.
For example, Subscriber’s address (where the business requires to keep a record of current & just one previous address). In this case, we can dissolve the ‘address’ column into two different columns – ‘current address’ and ‘previous address’.
- Type-4: In this type of dimension, the historical data is preserved in a separate table. The main dimension table holds only the current data.
- For example, the main dimension table will have only one row per subscriber holding its current address. All other previous addresses of the subscriber will be kept in the separate history table. This type of dimension is hardly ever used.
A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to type 2, but only you add one extra column to denote which record is the current record.
|Key||Customer||Group||Start Date||End Date||Current Flag|
|1||C1||G1||1st Jan 2000||31st Dec 2005||N|
|2||C1||G2||1st Jan 2006||NULL||Y|
74.Employee health details are hidden from his employer by the health care provider. Which level of data hiding is this? Conceptual, physical or external?
This is the scenario of an external level of data hiding
75.What is the form of fact table & dimension table?
Generally, the fact table is in normalized form and dimension table is in de-normalized form
76. If a unique constraint is applied to a column then will it throw an error if you try to insert two nulls into it?
No, it will not throw any error in this case because a null value is unequal to another null value. So, more than one null will be inserted in the column without any error.
77.Incremental loading for dimensions?
In a dimensional model, we may perform incremental loading for dimension tables also. One may argue that this wont be necessary as data volume in dimension tables are not as high as the data volumes in the fact tables, hence we can simply do a full load every time.
growth in the data in dimension tables and things can get quite heavy especially if we are trying to load SCD type 2 dimensions.
Standard Method of Loading
Like before, for our purpose we will assume we have the below customer table in our source system from where we need to perform the data loading.
CustomerID CustomerName Type LastUpdatedDate 1 John Individual 22-Mar-2012 2 Ryan Individual 22-Mar-2012 3 Bakers' Corporate 23-Mar-2012
As discussed in the previous article, a typical SQL query to extract data incrementally from this source system will be like this:
SELECT t.* FROM Customer t WHERE t.lastUpdatedDate > (select nvl( max(b.loaded_until), to_date('01-01-1900', 'MM-DD-YYYY') ) from batch b where b.status = 'Success');
Here “batch” is a separate table which stores the date until which we have successfully extracted the data.
Batch_ID Loaded_Until Status 1 22-Mar-2012 Success 2 23-Mar-2012 Success
78.Which one to use: “Entry Date” / “Load Date” or “Last Update Date”?
In an incremental load methodology, we should extract the record when it is first created and after that whenever the record is updated. Therefore, we should always look for “last update date” column for extracting records. This is because, “entry date” or “load date” columns in the source systems are not enough to determine if the record is updated in the later point in time.
Often source systems maintain 2 different columns as load_date and last_update_date. When extracting data based on “last update date”, ensure that source systems always populate “last updated date” field with “load date” when the record is first created.
79.What are the benefits of incremental loading of dimension tables?
- Once we extract records incrementally based on their last update date, we can compare each record with the target based on their natural keys and determine if the record is a new record or updated record.
- However, if we do not extract incrementally (and every time extract all the records from source), then the number of records to compare against target will be much higher resulting into performance degradation.
- If we are doing incremental loading, records that do not have any change will not come – only new or updatable records will come. But if we are doing full load, everything will come irrespective of any change.
80.What is Dimension table? Explain with example.
- Dimension table is table which describes the business entities of an enterprise which describes the objects in a fact table.
- Dimension table has primary key which uniquelly identifies each dimension row.
- Dimension table is sometimes called as lookup or reference table.
- The primary key of dimension table is used to associate relationship between fact table which contains foreign key.
- Dimension tables are normally in de-normalized form because these tables are only used to analyse the data and not used to execute transactions.
The fields in a dimension table is used to complete following 3 important requirement :
- Query Constrainting
- Grouping /Filtering
- Report labeling
Real Life Example :
Consider following table which contains item information.In the following table ITEM KEY is primary key which uniquely identifies the rows in the dimension table. ITEM KEY will be present in Fact table.
|ITEM KEY||ITEM NAME||BRAND||SOLD BY||Category|
In the image i have explained which are fact and which are dimension tables. You will able to see there are four dimensions :
81.History Preserving in Dimensional Modeling
One of the important objectives while doing data modeling is, to develop a model which can capture the states of the system with respect to time. You know, nothing lasts forever! Product prices change over time, people change their addresses, marital status, employers and even their names. If you are doing data modeling for a data warehouse — where we are particularly interested about historical analysis – it is crucial that we develop some method of capturing these changes in our data model
82.Types of Changing Dimensions
There are some dimensions that do not change at all. For example, let’s say you have created a dimension table called “Gender”. Below are the structure and data of this dimension table: Gender
Slowly Changing Dimension
SCD Type 1
As mentioned above, we design a dimension as SCD type 1 when we do not want to store the history. That is, whenever some values are modified in the attributes, we just want to update the old values with the new values and we do not care about storing the previous history.
We do not store any history in SCD Type 1
Please mind, this is not same as “Unchanged Dimension” discussed in the previous article. In case of an unchanged dimension, we assume that the values of the attributes of that dimension will not change at all. On the other hand, here in case of a SCD Type 1 dimension, we assume that the values of the attributes will change slowly, however, we are not interested to store those changes. We are only interested to store the current or latest value. So every time it changes we will update the old value with new ones.
Handling SCD Type 1 Dimension in ETL Process
SCD Type 1 dimensions are loaded using “Merge” operation which is also known as “UPSERT” as an abbreviation of “Update else Insert”.
SCD Type 1 dimensions are loaded by Merge operations
In pure ANSI SQL syntax, there is a particular statement that help you achieve the UPSERT operation. It’s called “MERGE” statement
SCD Type 2
We will take our “Food” dimension table as an example here, where “Price” is a variable factor. Food
Design of SCD Type 2 Dimension
In order to design the above table as SCD Type 2, we will have to add 3 more columns in this table, “Date From”, “Date To” and “Latest Flag”. These columns are called type 2 metadata columns. See below: Food
3 metadata columns
As you can see from the design above, it is now possible to go back to any date in the history and figure out what was the value of the “Price” attribute of “Food” dimension at that point in time.
Surrogate key for SCD Type 2 dimension
Note from the above example that, each time we generate a new row in the dimension table, we also assign a new key to the record.
This is the key that flows down to the fact table in a typical Star schema design. The value of this key, that is the numbers like 1, 2, 3, …. , 7 etc. are not coming from the source systems.
Instead those numbers are just like sequential running numbers which are generated automatically at the time of inserting these records.
These numbers are unique, so as to uniquely identify each record in the table, and are called “Surrogate Key” of the table.
As obvious, multiple surrogate keys may be related to the same item, however, each key will relate to one particular state of that item in time. In the above example, keys 2, 5 and 7 are all linked to “Veggie Burger” but they represent the state of the record in 3 different time spans. It’s worth noting that there would be only one record with latest flag = “Y” among multiple records of the same item.
Alternate Design of SCD Type 2: Addition of Version Number
In this design pattern, the records with highest version will always be the latest record. If we utilize this design in our earlier example, the dimension table will look like this: Food
Off course, we can also keep the “Latest Flag” column in the above table if we wish.
Handling SCD Type 2 Dimension in ETL Process
unlike SCD Type 1, Type 2 requires you to insert new records in the table as and when any attribute changes.
This is obviously different from SCD Type 1.
Because in case of SCD Type 1, we were only updating the record. But here, we will need to update old record (e.g. changing the latest flag from “Y” to “N”, updating the “Date To”) as well as we will need to insert a new record.
Like before, we can use the “natural key” to first compare if the source record is existing in the target or not. If not, we will simply insert the record in the target with new surrogate key.
But if it already exists in the target, we will have to check if any value of the attributes has changed between source and target – if not, we can ignore the source record.
But if yes, we will have to update the existing record as “N” and insert a new record with new surrogate key.
83.Performance Considerations of SCD Type 2 Dimension
SCD type 2, by design, tend to increase the volume of the dimension tables considerably.
Think of this: Let’s say you have an “employee” dimension table which you have designed as SCD Type 2. The employee dimensions has 20 different attributes and there are 10 attributes in this table which change at least once in a year on average (e.g. employee grade, manager’s name, department, salary, band, designation etc.). This means if you have 1,000 employees in your company, at the end of just one year, you are going to get 10,000 records in this dimension table (i.e. assuming on an average 10 attributes change per year – resulting into 10 different rows in the dimension table).
As you can see, this is not a very good thing performance wise as this can considerably slow down loading of your fact table as you will require to “look up” this dimension table during your fact loading.
One may argue that, even if we have 10,000 records, we will actually have only 1,000 records with Latest_Flag = ‘Y’ and since we will only lookup records with Latest_Flag = ‘Y’, the performance will not deteriorate.
This is not entirely true. While utilizing the Latest_Flag = ‘Y’ filter may decrease the size of the lookup cache, but database will generally need to do a full table scan (FTS) to identify latest records.
Moreover, in many cases ETL developer will not be able to make use of Latest_Flag = ‘Y’ column if the transactional records do not always belong to the latest time (e.g. late arriving fact records or loading fact table at later point in time – month end load / week end load etc.).
In those cases, putting latest_flag = ‘Y’ filter will be functionally incorrect as you should determine the correct return key on the basis of “Date To”, “Date From” columns.
SCD Type 3
In Type 2 design above, we have seen that whenever the values of the attributes change, we insert new rows to the table. In case of type 3, however, we add new column to the table to store the history.
So let’s say, we have a table where we have 2 column initially – “Key” and “attribute”.
If the record 1 changes its attribute from A to D, we will add one extra column to the table to store this change.
If the record again change attribute values, we will again have to add columns to store the history of the changes
Isn’t then SCD Type 3 very cumbersome?As you can see, storing the history in terms of changing the structure of the table in this way is quite cumbersome and after the attributes are changed a few times the table will become unnecessarily big and fat and difficult to manage.But that does not mean SCD Type 3 design methodology is completely unusable. In fact, it is quite usable in a particular circumstance – where we just need to store the partial history information.
Let’s think about a special circumstance where we only need to know the “current value” and “previous value” of an attribute. That is, even though the value of that attribute may change numerous times, at any time we are only concerned about its current and previous values.
In such circumstances, we can design the table as type 3 and keep only 2 columns – “current value” and “previous value” like below.
wherein a certain calculated field in the report used to depend on the latest and previous values of the customer status.
That calculated attribute was called “Churn Indicator” (churn in telecom business generally means leaving a telephone connection) and the rule
correct value of churn indicator, you do not need to know complete history of changes of customer’s status. All you need to know is the current and previous status. In this kind of partial history scenario, SCD Type 3 design is very useful.
Note here, compared to SCD Type 2, type 3 does not increase the number of records in the table thereby easing out performance concerns.
84.Rapidly Changing Dimensions
If you design a dimension table that has a rapidly changing attribute, then your dimension table will become rapidly changing dimension.
Every month, the status of any subscriber keeps on changing multiple times based on his or her account balance thereby making the “Subscribers” dimension one rapidly changing dimension.
Implementing Rapidly changing dimension
Handling rapidly changing dimensions are tricky due to various performance implications. This article attempts to provide some methodologies on handling rapidly changing dimensions in a data warehouse.
the problem with type 2 slowly changing dimension is, with every change in the dimensional attributes it increases the number of rows in the table. If lot of changes happen in the attributes of the dimension table (that is to say that the dimension is rapidly changing), the table quickly becomes bulky causing considerable performance issues. Hence the typical solution of SCD Type 2 dimensions may not be a very good fit for rapidly changing scenarios.
The method that we are going to consider here assumes the fact that, not all the attributes of a dimension table are rapidly changing in nature.
There might be a few attributes which are changing quite often and some other attributes which seldom change. If we can separate the fast changing attributes from the slowly changing ones and move them in some other table while maintaining the slowly changing attributes in the same table, we can get rid of the issue of bulking up the dimension table.
So let’s take one example to see how it works. Let’s say CUSTOMER dimension has following columns:
While attributes like name, gender, marital status etc. do not change at all or rarely change,
let’s assume customer tier and status change every month based on customer’s buying pattern. If we decide to keep status and tier in the same SCD Type 2 Customer dimension table, we could risk filling-up the table too much too soon. Instead, we can pull out those two attributes in yet another table, which some people refer as JUNK DIMENSION. Here is how our junk dimension will look like. In this case, it will have 3 columns as shown below.
Junk Dimension Structure
The column SEGMENTATION_KEY is a surrogate key. This acts as the primary key of the table. Also since we have removed status and tier from our main dimension table, the dimension table now looks like this:
Next, we must create a linkage between the above customer dimension to our newly created JUNK dimension. Note here, we can not simply pull the primary key of the JUNK dimension (which we are calling as SEGMENTATION_KEY) into the customer dimension as foreign key. Because if we do so, then any change in JUNK dimension will require us to create a new record in Customer dimension to refer to the changed key. This would in effect again increase the data volume of the dimension table. We solve this problem by creating one more mini table in between the original customer dimension and the junk dimension. This mini dimension table acts as a bridge between them. We also put “start date” and “end date” columns in this mini table so that we can track the history. Here is how our new mini table looks like:
Mini Dimension Structure
This table does not require any surrogate key. However, one may include one “CURRENT FLAG” column in the table if required. Now the whole model looks like this:
Maintaining the Junk Dimension
If number of attributes and the number of possible distinct values per attributes (cardinality) are not very large in the Junk dimension, we can actually pre-populate the junk dimension once and for all. In our earlier example, let’s say possible values of status are only “Active” and “Inactive” and possible values of Tier are only “Platinum”, “Gold” and “Silver”. That means there can be only 3 X 2 = 6 distinct combinations of records in this table. We can pre-populate the table with these 6 records from segmentation key = 1 to 6 and assign one key to each customer based on the customers status and tier values.
How does this Junk dimension help?
Since the connection between the segmentation key and customer key is actually maintained in the mini dimension table, frequent changes in tier and status do not change the number of records in the dimension table. Whenever a customer’s status or tier attribute changes, a new row is added in the mini dimension (with START_DATE = date of change of status) signifying the current relation between the customer and the segmentation.
It’s also worth mentioning that in this schema, we can manage the original customer dimension table in SCD type 1 or Type 2 methods, but we will have to take extra care to update the mini dimension also as and when there is a change in the key in the original dimension table.