MODELS
ERD
DIMENSIONAL MODELLING
STAR vs SNOWFLAKE
FACTS
DIMENSION
MEASURES
HIERARCHY
RELATIONSHIP
METADATA
KEY
MODELS
1.What is Enterprise Data Modeling?
- common consistent view + their relationships across the enterprise
- access to information scattered throughout an enterprise
- different divisions
- departments
- different databases
- data models.
- comprises of all entities required by an enterprise.
- split up into subject areas.
- called as global business model and the entire information about the enterprise would be captured in the form of entities.
- When a enterprise logical data model is transformed to a physical data model, supertypes and subtypes may not be as is. i.e. the logical and physical structure of supertypes and subtypes may be entirely different.A data modeler has to change that according to the physical and reporting requirement.
- When a enterprise logical data model is transformed to a physical data model, length of table names, column names etc may exceed the maximum number manually edit that and change the physical names according to database or organization’s standards.
- standardization of the data model.
- conformed dimension to connect to the same attribute present in several tables.
- Standard Abbreviation + consistent across the data model.
2. What is Data Modeling?
- A data model is a conceptual representation of business requirement (logical data model) or database objects (physical) required for a database and are very powerful in expressing and communicating the business requirements and database objects.
- process of creating a data model by applying a data model theory to create a data model instance.
- database design technique which is useful to fetch different types of complex sql queries in DWH environment.
- representation of Logical data model and physical data model according to the Business requirements.
- two types of tables you can say it as facts and dimensions tables.
- Facts tables -> numerical value which can be aggregated and analyzed on fact values.
- Dimension defines hierarchies and description of fact values.
- Data Modelling is the diagrammatic representation showing how the entities are related to each other.
- Generally, the data models are created in data analysis & design phase of software development life cycle.
The data warehouse modeling includes:
- Top-down drive approach
- Fact tables and dimensions tables
- Multidimensional model or often called star schema
- Normalization and denormalization
3.Why is Data Modeling Important?
- data model uses easily understood notations and natural language , it can be reviewed and verified as correct by the end-users.
- structuring and organizing data.
- implemented in a data base management system.
- will impose (implicitly or explicitly) constraints or limitations on the data placed within the structure.
- data objects are completely and accurately represented.
- Managing large quantities of structured and unstructured data is a primary function of information systems.
4.Steps to create a Data Model?
- some of the steps may be excluded or included in addition to these.
- Sometimes, to develop a data model based on the existing database.
- has to reverse engineer the database and create a data model.
- Get Business requirements.
- Create High Level Conceptual Data Model.
- Create Logical Data Model.
- Select target DBMS where data modeling tool creates the physical schema.
- Create standard abbreviation document according to business standard.
- Create domain.
- Create Entity and add definitions.
- Create attribute and add definitions.
- Based on the analysis, try to create surrogate keys, super types and sub types.
- Assign datatype to attribute. If a domain is already present then the attribute should be attached to the domain.
- Create primary or unique keys to attribute.
- Create check constraint or default to attribute.
- Create unique index or bitmap index to attribute.
- Create foreign key relationship between entities.
- Create Physical Data Model.
- Add database properties to physical data model.
- Create SQL Scripts from Physical Data Model and forward that to DBA.
- Maintain Logical & Physical Data Model.
- For each release (version of the data model),
- Create a change log document for differences between the current version and previous version of the data model
5. What is forward engineering in a data model?
- DDL scripts are generated from the data model.
- Data modeling tools have options to create DDL scripts by connecting with various databases. With these scripts, databases can be created.
6. What is reverse engineering in a data model?
- creating the data models from database or scripts.
- Data modeling tools have options to connect to the database by which we can reverse engineer a database into a data model.
7.What is Data Modeling Development Cycle?
- Gathering Business Requirements – First Phase:
- Conceptual Data Modeling(CDM) – Second Phase:
- major entities, relationships
- Logical Data Modeling(LDM) – Third Phase:
- represents all of the business requirements of an organization.
- Physical Data Modeling(PDM) – Fourth Phase:
- required tables, columns, relationship, database properties
- Database – Fifth Phase:
- SQL code from physical data model. Then the SQL code is executed in server to create databases.
8.Categories of data model:
- the way many users perceive data.
- It uses concepts such as Entities, Attributes and Relationships.
Logical Data Model:
- Entities, Attributes, Super Types, Sub Types, Primary Key, foreign keys,Alternate Key, Inversion Key Entry, Rule, Relationship, Definition, business rule, etc
- representation of your database in logical way.
- actual implementation and extension of a conceptual data model.
- how data is stored in the computer by representing information such as record formats, record orderings and access path.
Physical Data Model :
- primary keys, foreign keys, table names, column names and column data types, tables, columns, relationship, database properties for the physical implementation of databases.
- representation of physical database.
- Database performance, indexing strategy, and physical storage are
- The approach by which physical data models are created is called as physical data modeling.
- The physical data model These entities were subscriber, member, healthcare provider, claim, bill, enrollment, group, eligibility, plan/product, commission, capitation, etc.
Representational/ Implementation data model.
- They include Hierarchical data model, Network data model and Relational data model.
9.What does data model contain?
Logical Data Model: Entity, Attributes, Super Type, Sub Type, Primary Key, Alternate Key, Inversion Key Entry, Rule, Relationship, Definition, business rule, etc
Physical Data Model: Table, Column, Primary key Constraint, Unique Constraint or Unique Index, Non Unique Index, Check Constraint, Default Value, Foreign Key, comment etc.
conceptual data model showing how the project looked like on a high-level
10.What is Conceptual Data Modeling?
- includes all major entities and relationships and
- does not contain much detailed
- Conceptual data model is created by gathering business requirements from various sources like business documents, discussion with functional teams, business analysts, smart management experts and end users who do the reporting on the database.
- Data modelers create conceptual data model and forward that model to functional team for their review.
Conceptual Data Model – Highlights
- CDM is the first step in constructing a data model in top-down approach and is a clear and accurate visual representation of the business of an organization.
- CDM visualizes the overall structure of the database and provides high-level information about the subject areas or data structures of an organization.
- CDM discussion starts with main subject area of an organization and then all the major entities of each subject area are discussed in detail.
- CDM comprises of entity types and relationships. The relationships between the subject areas and the relationship between each entity in a subject area are drawn by symbolic notation(IDEF1X or IE).
- In a data model, cardinality represents the relationship between two entities. i.e. One to one relationship, or one to many relationship or many to many relationship between the entities.
- CDM contains data structures that have not been implemented in the database.
In CDM discussion, technical as well as non-technical team projects their ideas for building a sound logical data mode
The approach by which conceptual data models are created is called as conceptual data modeling.
Each of the data entities has their own data attributes. For example, a data attribute of the provider will be provider identification number, few data attributes of the membership will be subscriber ID, member ID, one of the data attribute of claim will claim ID, each healthcare product or plan will be having a unique product ID and so on
11.What is Logical Data Modeling?
12.What is Physical Data Modeling?
13.Explain Common Errors you faced while doing data modeling?
- Building Large Data Models: Building large data models is not a good practice. If tables are more than 200 the data model become more and more complex. In that case that data model will fail.
- Unnecessary Surrogate keys: Use surrogate key only when the natural key cannot serve the purpose of a primary key.
- Purpose Missing: user does not know the purpose of the business. If user does not have proper understanding of the specified business there is no way to create specified data model for that business. So there is need to know the purpose of the business before creating data model.
- Inappropriate Denormalization : Don’t denormalize until and unless you have a solid & clear business reason to do so because de-normalization creates redundant data which is difficult to maintain.
14.Common Mistakes in Data Modelling
Avoid Large Data Models
Building massive data models:
Large data models are like to have more design faults. Try to restrict your data model to not more than 200 tables.
Watch carefully if you have following signs in your data model?
- Lots of entities with no or very few non-key attributes?
- Lots of modelling objects with names which no business user would recognize?
- You yourself have lot of troubles coming up with the names of the attributes?
in the early phase of data modelling ask yourself these questions –
- Is the large size really justified?
- Is there any extraneous content that I can remove?
- Can I shift the representation and make the model more concise?
- How much work is to develop the application for this model and is that worthwhile?
- Is there any speculative content that I can remove? (Speculative contents are those which are not immediately required but still kept in the model as “might be required” in the future.)
Lack of Clarity or Purpose
Lack of purpose: If you do not know that what is your business solution is intended for, you might come up with an incorrect data model. So having clarity on the business purpose is very important to come up with a right data model.
Reckless violation of Normal Form
Traps in Dimensional Modelling
Snow-flaking between two Type-II slowly changing dimension (SCD) tables
Theoretically speaking there is no issue with such a model, at least until one tries to create the ETL programming (extraction-transformation-loading) code behind these tables.
Consider this – in the above example, suppose something changed in the “ProductType” table which created a new row in “ProductType” table (since ProductType is SCD2, any historical change will be maintained by adding new row). This new row will have new surrogate key. But in the Product table, any existing row is still pointing to the old product type record and hence leading to data anomaly.
Indiscriminate use of Surrogate keys
Surrogate key should not be used unnecessarily. Use surrogate key only when the natural key cannot serve the purpose of a primary key.
Surrogate Keys are used as a unique identifier to represent an entity in the modeled world. Surrogate keys are required when we cannot use a natural key to uniquely identify a record or when using a surrogate key is deemed more suitable as the natural key is not a good fir for primary key (natural key too long, data type not suitable for indexing etc.)
But surrogate keys also come with some
disadvantages.
The values of surrogate keys have no relationship with the real world meaning of the data held in a row. Therefore over usage of surrogate keys (often in the name of “standardization”) lead to the problem of disassociation and creates unnecessary ETL burden and performance degradation.
dimension tables hurts the performance even more than the growth of fact table as dimension tables are typically conformed and used across different schema.
Even query optimization becomes difficult when one disassociates the surrogate key with the natural key. The reason being – since surrogate key takes the place of primary key, unique index is applied on that column. And any query based on natural key identifier leads to full table scan as that query cannot take the advantage of unique index on the surrogate key.
Before assigning a surrogate key to a table, ask yourself these questions –
- Am I using a surrogate key only for the sake of maintaining standard?
- Is there any unique not null natural key that I can use as primary key instead of a new surrogate key?
- Can I use my natural key as primary key without degrading the performance?
If the answer of the above questions are ‘YES’ – do yourself a favor. Don’t use the surrogate key.
Unnecessary de-normalization: Don’t denormalize until and unless you have a solid & clear business reason to do so because de-normalization creates redundant data which is difficult to maintain
16.Standardization Needs | Modeling data:
- Several data modelers may work on the different subject areas of a data model and all data modelers should use the same naming convention, writing definitions and business rules.
- helps in understanding the business in a better way.
- Inconsistency across column names and definition would create a chaos across the business.
- For example, when a data warehouse is designed, it may get data from several source systems and each source may have its own names, data types etc.
- These anomalies can be eliminated if a proper standardization is maintained across the organization.
- Table Names Standardization:
- Column Names Standardization:
- Database Parameters Standardization:
Table Names Standardization:
Giving a full name to the tables, will give an idea about data what it is about. Generally, do not abbreviate the table names; however this may differ according to organization’s standards. If the table name length exceeds the database standards, then try to abbreviate the table names. Some general guidelines are listed below that may be used as a prefix or suffix for the table.
Examples:
Lookup – LKP – Used for Code, Type tables by which a fact table can be directly accessed.
e.g. Credit Card Type Lookup – CREDIT_CARD_TYPE_LKP
Fact – FCT – Used for transaction tables:
e.g. Credit Card Fact – CREDIT_CARD_FCT
Cross Reference – XREF – Tables that resolves many to many relationships.
e.g. Credit Card Member XREF – CREDIT_CARD_MEMBER_XREF
History – HIST – Tables the stores history.
e.g. Credit Card Retired History – CREDIT_CARD_RETIRED_HIST
Statistics – STAT – Tables that store statistical information.
e.g. Credit Card Web Statistics – CREDIT_CARD_WEB_STAT
Column Names Standardization:
Some general guidelines are listed below that may be used as a prefix or suffix for the column.
Examples:
Key – Key System generated surrogate key.
e.g. Credit Card Key – CRDT_CARD_KEY
Identifier – ID – Character column that is used as an identifier.
e.g. Credit Card Identifier – CRDT_CARD_ID
Code – CD – Numeric or alphanumeric column that is used as an identifying attribute.
e.g. State Code – ST_CD
Description – DESC – Description for a code, identifier or a key.
e.g. State Description – ST_DESC
Indicator – IND – to denote indicator columns.
e.g. Gender Indicator – GNDR_IND
Database Parameters Standardization:
Some general guidelines are listed below that may be used for other physical parameters.
Examples:
Index – Index – IDX – for index names.
e.g. Credit Card Fact IDX 01 – CRDT_CARD_FCT_IDX01
Primary Key – PK – for Primary key constraint names.
e.g. CREDIT Card Fact PK01- CRDT-CARD_FCT_PK01
Alternate Keys – AK – for Alternate key names.
e.g. Credit Card Fact AK01 – CRDT_CARD_FCT_AK01
Foreign Keys – FK – for Foreign key constraint names.
e.g. Credit Card Fact FK01 – CRDT_CARD_FCT_FK01
17.What is Data Modeler Role?
- A logical data modeler designs the data model to suit business requirements, creates and maintains the lookup data, compare the versions of data model, maintains change log, generate reports from data model
18.What is data model repository?
- Data Model
- entity definition,
- attribute definition,
- columns,
- data types
19.What is Data cleaning:?
Data cleaning:
to provide a accurate data to DW.
- identify
- incomplete,
- incorrect or
- inaccurate data
then correct the data or modify the data using the business rule sand also provide the default when values for missing data.
We make sure the data is consistent before we load the data DW.
- Like….
- trimming,
- lowercase,
- replacing null,
- date formats,
- deleting the typing mistake
20.What particulars you would need to come up with a conceptual model in a health care domain project?
- Different categories of health care plan and products.
- Type of subscription (group or individual).
- Set of health care providers.
- Claim and billing process overview.
ERD
21.What is the difference between ER Modeling and Dimensional Modeling?
- ER modeling will have logical and physical model but Dimensional modeling will have only Physical model.
- ER Modeling is used for normalizing the OLTP database design whereas
- Dimensional Modeling is used for de-normalizing the ROLAP and MOLAP design
- The Entity-Relationship (ER) model was as a way to unify the network and relational database views.
- Simply stated the ER model is a conceptual data model that views the real world as entities and relationships.
- A basic component of the model is the Entity-Relationship diagram which is used to visually represents data objects.
- The model has been extended and today it is commonly used for database design For the database designer, the utility of the ER model is:it maps well to the relational model.
- ER model or entity-relationship model is a particular methodology of data modeling wherein the goal of modeling is to normalize the data by reducing redundancy.
- This is different than dimensional modeling where the main goal is to improve the data retrieval mechanism.
.
- Data models are tools used in analysis to describe the data requirements and assumptions in the system from a top-down perspective.
- They also set the stage for the design of databases later on in the SDLC.
There are three basic elements in ER models:
- Entities are the “things” about which we seek information.
- Attributes are the data we collect about the entities.
- Relationships provide the structure needed to draw information from multiple entities
23.What are Aggregate tables?
- Aggregate tables are the tables which contain the existing warehouse data which has been grouped to certain level of dimensions.
- It is easy to retrieve data from the aggregated tables than the original table which has more number of records
- This table reduces the load in the database server and increases the performance of the query.


One to One, One to many, and many to many are different types of cardinalities.
In a database,
high cardinality means more unique values are stored in a column and vice versa.
25.What is a subtype and super type entity?
- An entity can be split into many entities (sub-entities) and grouped based on some characteristics and each sub entity will have attributes relevant to that entity.These entities are called subtype entities.
- The attributes which are common to these entities are moved to a super (higher) level entity, which is called as supertype entity.
let’s say we have these different entities – vehicle, car, bike, Economy car, family car, sports car. Here, a
- vehicle is a super type entity.
- Car and bike are its sub-type entities.
Furthermore,
- economy car, sports car, and family car are sub-type entities of its
- super-type entity- car.
- two-wheelers and four wheelers.
- super-type entity is ‘vehicle’.
- A super-type entity is the one which is at a higher level.
- Sub-type entities are ones which are grouped together on the basis of certain characteristics.
26.What is data sparsity and how it effect on aggregation?
- how much data we have for a particular dimension/entity of the model.
- It affects aggregation depending on how deep the combination of members of the sparse dimension make up.
- If the combination is a lot and those combination do not have any factual data then creating space to store those aggregations will be a waste as a result, the database will become huge.
DIMENSIONAL MODELING
https://dwbi.org/data-modelling/dimensional-model/1-dimensional-modeling-guide
27.What is Dimensional Modelling?
- concept –> to build their datawarehouse.
- In this design model all the data is stored in two types of tables
- Fact table
- facts/measurements of the business and
- different transactional measurements
- foreign keys from dimension tables that qualifies the data.
- contains numerical value which can be aggregated and analyzed on fact values.
- Fact tables are used to store various transactional measurements from “dimension tables” that qualifies the data.
- Facts are typically (but not always) numeric values that can be aggregated,
- Dimension table
- contains the context of measurements
- the dimensions on which the facts are calculated.
- Dimension defines hierarchies and description of fact values
- dimensions are groups of hierarchies and descriptors that define the facts.
- The goal of Dimensional model is not to achieve high degree of normalization but to facilitate easy and faster data retrieval.
- Faster Data Retrieval
- Better Understandability
- Extensibility
- Dimensional modeling is often used in Data warehousing.
- A snowflake and
- star schema represent data modeling
- Conformed
28.Step by Step Approach to Dimensional Modeling
- Step 1: Identify the dimensions
- Step 2: Identify the measures
- Step 3: Identify the attributes or properties of dimensions
- Step 4: Identify the granularity of the measures
- Step 5: History Preservation (Optional)
Dimensional Modeling Schema
3 different schema, namely –
- Star,
- Snowflake and
- Conformed
Storing hierarchical information in dimension tables
check hierarchy section
29. What is OLTP data modeling?
- OLTP acronym stands for ONLINE TRANSACTIONAL PROCESSING.
- The approach by which data models are constructed for transactions is called as OLTP data modeling. Example: all online transactions, bank transactions, trading transactions.
30.Performance Considerations for Dimensional Modeling
Surrogate Key is not mandatory
Keep it simple – nothing succeeds like simplicity.
Ask yourself if you really require the plethora of those attributes that you have kept in your dimension table? Or do you really need that additional table that you are thinking to introduce in your model?
- Additional Programming effort – mapping effort from source to staging and staging to data warehouse
- Additional testing effort – 1 additional test case to be written and tested by the testers to check the correctness of the data
- Additional space requirement – If your fact table is going to contain 100 million records in next 2/3 years (which is not a big number by any standard), it’s going to take following extra space.
- This is because introduction to surrogate keys in dimension tables necessitates additional lookup operation when doing the fact data loading and lookup is a very costly affair, if not the most costly affair.
- Keep it simple – nothing succeeds like simplicity
-
Reduce Complexity
-
Reducing pre-calculated values
-
Removing Unnecessary Constraints
-
-
Stop snow-flaking
- Choose the attributes of SCD Type 2 dimensions judiciously
In a typical implementation, SCD Type 2 tables preserve the history by means of adding new row to the table whenever a change happens. Because of this property, SCD type 2 tables tend to grow larger and larger day by day thereby degrading the performance. This is a problem because growth of dimension tables hurts the performance even more than the growth of fact table as dimension tables are typically conformed and used across different schema. In order to control this growth, it is very important that we consider only those columns in SCD implementation, which we really want to preserve the history of.
This means, if you have one hundred columns/entities in the source system for a specific table, don’t start tracking SCD changes for all of them. Rather, carefully consider the list of the columns where you really need to preserve the history – track only those as Type 2 in your SCD implementation and just consider the other columns as Type 1 (so that even if they change, no row is added to the table
- Don’t Create a Snapshot Fact and a SCD Type 2 table for the same purpose
. A snapshot fact always show the latest (or last known) state of the measures. The latest records of a SCD type 2 dimension also do the same.
- fact shows the state of the measures
- SCD Type 2 table shows the state of the attributes.
But there are cases where this difference can become very blurry.
- Where do you store the attributes such as “No of telephone lines” of one customer or “No of bank accounts” of one customer? Is number of telephone lines of a customer an attribute of customer itself? Not quite. Then obviously we won’t store it in dimension table as we need to store this in fact table.
- But here comes the problem – all such records in fact table get different surrogate keys for the same customer since the key comes from a SCD Type 2 table.
- Because of such key variance, it is impossible to join two such facts in a single query if those facts are not loaded in the same time.
- To preserve the key invariance, the dimension table needs to be self-joined twice in the middle and this causes lot of performance issues
- Consider Indexing Carefully
- Consider Partitioning
- Avoid Fact to Fact Join
- Give some attention to your server infrastructure
It is oriented around understand ability and performance. Dimensional modeling always uses the concepts of facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts.
STAR vs SNOWFLAKE
31.Which schema is better – star or snowflake?
- The choice of a schema always depends upon the project requirements & scenarios.
- Since star schema is in de-normalized form, you require fewer joins for a query.The query is simple and runs faster in a star schema.
- star schema has a high level of redundancy and thus it is difficult to maintain.
- when dimensions have less rows
- If the purpose of your project is to do more of metrics analysis, you should go with a star schema. For example, if you need to find out that “what is the claim amount paid to a particular subscriber?”– go with a star schema.
- Coming to the snowflake schema, since it is in normalized form,
- it will require a number of joins as compared to a star schema, the query will be complex and execution will be slower than star schema.
- snowflake schema does not contain redundant data and thus it is easy to maintain.
- when Dimension table big in size it reduces space by normalizing
- If the purpose of your project is to do more of dimension analysis, you should go for snowflake schema. For example, if you need to find out that “how many subscribers are tied to a particular plan are currently active?”– go with snowflake model.
- we used snowflake schema because we had to do analysis across several dimensions and generate summary reports for the business.
- Another reason for using snowflake schema was it is less memory consumption
- The hierarchies of star schemas are denormalized, while the hierarchies of snowflake schemas are normalized.
- hierarchy for dimensions table stored same DM tables in star but in snowflake its broken up into other table

Star Schema
- facts and all the primary keys of your dimensional tables in Fact table.
- And fact tables primary is the union of its all dimension table key.
- In star schema dimensional tables are usually not in BCNF form.
- starflake schema is a combination of a star schema and a snowflake schema.
- The hierarchies of star schemas are denormalized, while the hierarchies of snowflake schemas are normalized.
- Starflake schemas are normalized to remove any redundancies in the dimensions.
SnowFlake
- Its almost like star schema but in this our dimension tables are in 3rd NF, so more dimensions tables. And these dimension tables are linked by primary, foreign key relation.
- snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake shape.
- The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions.
33.SNOW-FLAKE SCHEMA DESIGN
Snow flake schema is just like star schema but the difference is, here one or more dimension tables are connected with other dimension table as well as with the central fact table. See the example of snowflake schema below.
Here we are storing the information in two dimension tables instead of one. We are storing the food type in one dimension (“type” table as shown below) and food in other dimension. This is a snowflake design. Type
KEY | TYPE_NAME |
---|---|
1 | Burger |
2 | Fries |
Food
KEY | TYPE_KEY | NAME |
---|---|---|
1 | 1 | Chicken Burger |
2 | 1 | Veggie Burger |
3 | 2 | French Fries |
4 | 2 | Twister Fries |
snow flaking actually increase the level of normalization in the data. This has obvious disadvantage in terms of information retrieval since we need to read more tables (and traverse more SQL joins) in order to get the same information. Example, if you wish to find out all the food, food type sold from store 1, the SQL queries from star and snowflake schemata will be like below
FACTS
34.Types of Fact tables
Mainly 4 types
- Transactional fact
- Periodic snapshots
- Accumulating snapshots
- Temporal snapshots
others :
- fact-less-fact
- coverage fact
- incident and snapshot facts
- conformed fact
- aggregate fact table
- cummulative fact
- centipede fact
There are four fundamental measurement events, which characterize all fact tables
Transactional
- A transactional table is the most basic and fundamental.
- The grain associated with a transactional fact table is usually specified as “one row per line in a transaction”, e.g., every line on a receipt.
- Typically a transactional fact table holds data of the most detailed level, causing it to have a great number of dimensions associated with it.
- These fact tables represent an event that occurred at an instantaneous point in time.
- A row exists in the fact table for a given customer or product only if a transaction has occurred.
- A given customer or product is likely linked to multiple rows in the fact table because the customer or product is involved in more than one transaction.
- Transaction data often is structured quite easily into a dimensional framework.
- The lowest-level data is the most natural dimensional data, supporting analyses that cannot be done on summarized data.
- Unfortunately, even with transaction-level data, there is still a whole class of urgent business questions that are impractical to answer using only transaction detail.
Periodic snapshots
- The periodic snapshot, as the name implies, takes a “picture of the moment“, where the moment could be any defined period of time, e.g. a performance summary of a salesman over the previous month.
- A periodic snapshot table is dependent on the transactional table, as it needs the detailed data held in the transactional fact table in order to deliver the chosen performance output.
- This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts.
- Example: Daily balances fact can be summed up through the customers’ dimension but not through the time dimension.
- Periodic snapshots are needed to see the cumulative performance of the business at regular, predictable time intervals.
- Unlike the transaction fact table, where we load a row for each event occurrence, with the periodic snapshot, we take a picture of the activity at the end of a day, week, or month, then another picture at the end of the next period, and so on.
- Example: A performance summary of a salesman over the previous month.
Accumulating snapshots
- This type of fact table is used to show the activity of a process that has a well-defined beginning and end, e.g., the processing of an order.
- An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated.
- An accumulating snapshot table often has multiple date columns, each representing a milestone in the process.
- Therefore, it’s important to have an entry in the associated date dimension that represents an unknown date, as many of the milestone dates are unknown at the time of the creation of the row.
- This type of fact table is used to show the activity of a process that has a well-defined beginning and end. For example, the processing of an order.
- An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated.
- Accumulating snapshots almost always have multiple date stamps, representing the predictable major events or phases that take place during the course of a lifetime.
- Often there’s an additional date column that indicates when the snapshot row was last updated. Since many of these dates are not known when the fact row was first loaded, we must use surrogate date keys to handle undefined dates.
- In sharp contrast to the other fact table types, we revisit accumulating snapshot fact table rows to update them.
- Unlike the periodic snapshot, where we hang onto the prior snapshot, the accumulating snapshot merely reflects the accumulated status and metrics. Sometimes accumulating and periodic snapshots work in conjunction with one another.
Temporal snapshots
- By applying temporal database theory and modeling techniques the temporal snapshot fact table allows to have the equivalent of daily snapshots without really having daily snapshots.
- It introduces the concept of time Intervals into a fact table, allowing to save a lot of space, optimizing performances while allowing the end user to have the logical equivalent of the “picture of the moment” they are interested in.
Centipede Fact Table
Centipede fact table is a normalized fact table. Modeller may decide to normalize the fact instead of snow flaking dimensions tables.
Conformed Fact Tables
They are measures re-used across multiple dimension models.
For example, KPI such as profit, revenue etc
Incident and Snapshot Facts
A fact table stores some kind of measurements and are captured against a specific time. Now it might so happen that the business might not able to capture all of its measures always for every point in time.
Then those unavailable measurements can be kept empty (Null) or can be filled up with the last available measurements. The first case is the example of incident fact and the second one is the example of snapshot fact.
Cumulative Fact
This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day
Fact-less Fact Tables
A fact table that does not contain any measure is a fact-less fact table. This table will only contain keys from different dimension tables. This is often used to resolve a many-to-many cardinality issue.
For example, a fact table which has only productID and date key is a fact-less fact table.
35.What is a fact-less-fact?
- A fact table that does not contain any measure is called a fact-less fact.
- This table will only contain keys from different dimension tables.
- This is often used to resolve a many-to-many cardinality issue
Explanatory Note:
Consider a school, where a single student may be taught by many teachers and a single teacher may have many students. To model this situation in dimensional model, one might introduce a fact-less-fact table joining teacher and student keys. Such a fact table will then be able to answer queries like,
- Who are the students taught by a specific teacher.
- Which teacher teaches maximum students.
- Which student has highest number of teachers.etc. etc.
Factless fact table is a fact table that contains no fact measure in it. It has only the dimension keys in it.
At times, certain situations may arise in the business where you need to have factless fact table. For example, suppose you are maintaining an employee attendance record system, you can have a factless fact table having three keys.
Employee_ID |
Department_ID |
Time_ID |
You can see that the above table does not contain any measure. Now if you want to answer below question, you can do easily using the above single factless fact table rather than having two separate fact tables:
“How many employees of a particular department were present on a particular day?”
So, factless fact table offers flexibility to the design.
Event capturing fact-less fact
This type of fact table establishes the relationship among the various dimension members from various dimension tables without any measured value.
For examples, Student attendance (student-teacher relation table) capturing table is the fact-less fact. Table will have entry into it whenever student attend class.
Following questions can be answered by the student attendance table:
Which student is taught by the maximum number of teachers?
Which class has maximum number of attendance?
Which teacher teaches maximum number of students?
All the above queries are based on the COUNT (), MAX () with GROUP BY.
Coverage table – Describing condition
This is another kind of fact-less fact. A fact-less-fact table can only answer ‘optimistic’ queries (positive query) but cannot answer a negative query. Coverage fact is used to support negative analysis reports. For example, an electronic store did not sell any product for give period of time.
If you consider the student-teacher relation table, the event capturing fact table cannot answer ‘which teacher did not teach any student?’ Coverage fact attempts to answer this question by adding extra flag 0 for negative condition and 1 for positive condition.
If the student table has 20 records and teacher table has 3 records then coverage fact table will store 20 * 3 = 60 records for all possible combinations. If any teacher is not teaching particular student then that record will have flag 0 in it.
36.What is a coverage fact?
- A fact-less-fact table can only answer ‘optimistic’ queries (positive query) but can not answer a negative query.
- Coverage fact table attempts to answer this – often by adding an extra flag column. Flag = 0 indicates a negative condition and flag = 1 indicates a positive condition
Again consider the illustration in the above example. A fact-less fact containing the keys of tutors and students can not answer a query like below,
- Which teacher did not teach any student?
- Which student was not taught by any teacher?
Why not? Because fact-less fact table only stores the positive scenarios (like student being taught by a tutor) but if there is a student who is not being taught by a teacher, then that student’s key does not appear in this table, thereby reducing the coverage of the table.
Coverage fact table attempts to answer this – often by adding an extra flag column. Flag = 0 indicates a negative condition and flag = 1 indicates a positive condition. To understand this better, let’s consider a class where there are 100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 = 500 records (all combinations) and if a certain teacher is not teaching a certain student, the corresponding flag for that record will be 0.z
37.What are incident and snapshot facts
- A fact table stores some kind of measurements.
- Usually these measurements are stored (or captured) against a specific time and these measurements vary with respect to time.
- Now it might so happen that the business might not able to capture all of its measures always for every point in time.
- Then those unavailable measurements can be kept empty (Null) or can be filled up with the last available measurements.
- The first case is the example of incident fact and the second one is the example of snapshot fact.
- Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiple facts tables accordingly.
- • A conformed dimension is a dimension that has exactly the same meaning and content when being referred from different fact tables.
- A conformed dimension can refer to multiple tables in multiple data marts within the same organization.
39. what is aggregate table and aggregate fact table … any examples of both?
- Aggregate table contains summarized data.
- The materialized view are aggregated tables. For ex in sales we have only date transaction. if we want to create a report like sales by product per year. in such cases we aggregate the date vales into week_agg, month_agg, quarter_agg,year_agg. to retrieve date from this tables we use @aggrtegate function.
- Aggregate table contains the [measure] values
- aggregated/grouped/summed up to some level of hierarchy.
contains aggregated data which can be calculated by using different aggregated functions like count,avg,min,max.e.t.c.Aggregated tables are most widely used tables in OLAP database.Aggregate functions are functions where the values of table or column are grouped together and form a single value.
Following are some aggregate functions:
- 1.Average
- 2.Count
- 3.MAX
- 4.MIN
- 5.Median
- 6.SUM
Using the above aggregate functions the data will be inserted in aggregate table.The aggregate tables are used for performance optimization and data is coming fast using aggregate table.Aggregations applied on database level improves the performance of the query as the query not hits directly on table it will hit on aggregate table and fetches data.
Real Example:
If table contains the data of year 2016 and 2017 ,User wants a actual count of records in the table monthly,Quarterly and yearly.Then We need to make aggregate table which contains count of records monthly,For Quarterly we need to create other table and push the quarterly count in that table.And we need to use that tables in the reports so that report performance will improve drastically
40.What is a fact & a fact table?ex missing
Facts represent quantitative data. For example – net amount due is a fact.
- A fact table contains numerical data and foreign keys from related dimensional tables.
- Fact Table contains the measurements or metrics or facts of business process.
- Fact table
- facts/measurements of the business and
- different transactional measurements
- foreign keys from dimension tables that qualifies the data.
- contains numerical value which can be aggregated and analyzed on fact values.
- Fact tables are used to store various transactional measurements from “dimension tables” that qualifies the data.
- Facts are typically (but not always) numeric values that can be aggregated,
If business process is “Sales” , then a measurement of this business process such as “monthly sales number” is captured in the Fact table.
The fact table contains the foreign keys,time dimensions,product dimension,customer dimension,measurement values.Following are some examples of common facts :No of unit sold,Margin,Sales revenue and the dimension tables are customer,time and product e.t.c. which is used to analyse data
There are following 3 types of fact :
- 1.Additive :
- Measures that can be added across any dimension
- 2.Non-additive:
- Measures that can not be added across any dimension
- 3.Semi-additive:
- Measures that can be added across some dimensions
.What are additive, semi-additive and non-additive measures?
Non-additive Measures
- Non-additive measures are those which can not be used inside any numeric aggregation function (e.g. SUM(), AVG() etc.).
- One example of non-additive fact is any kind of ratio or percentage.
- Example, 5% profit margin, revenue to asset ratio etc.
- A non-numerical data can also be a non-additive measure when that data is stored in fact tables,
- e.g. some kind of varchar flags in the fact table.
- For example, a ratio or a percentage column; a flag or an indicator column present in fact table holding values like Y/N, etc. is a non-additive measure.
Semi Additive Measures
- Semi-additive measures are those where only a subset of aggregation function can be applied.
- Let’s say account balance. A sum() function on balance does not give a useful result but max() or min() balance might be useful.
- Consider price rate or currency rate. Sum is meaningless on rate; however, average function might be useful.
Additive Measures
- Additive measures can be used with any aggregation function like Sum(), Avg() etc. Example is Sales Quantity etc.
41.What are Semi-additive and factless facts and in which scenario will you use such kinds of fact tables?EX missing
- Snapshot facts are semi-additive, while we maintain aggregated facts we go for semi-additive.EX: Average daily balance
- A fact table without numeric fact columns is called factless fact table.Ex: Promotion Facts
- While maintain the promotion values of the transaction (ex: product samples) because this table doesn’t contain any measures.
42.Why fact table is in normal form?
- Basically the fact table consists of the Index keys of the dimension/look up tables and the measures.
- so when ever we have the keys in a table .that itself implies that the table is in the normal form.
43.What is data sparsity and how it effect on aggregation?
Data sparsity is term used for how much data we have for a particular dimension/entity of the model.
It affects aggregation depending on how deep the combination of members of the sparse dimension make up. If the combination is a lot and those combination do not have any factual data then creating space to store those aggregations will be a waste as a result, the database will become huge.
44.Give us a non-computer example of preemptive and non-preemptive scheduling?
Consider any system where people use some kind of resources and compete for them. The non-computer examples for preemptive scheduling the traffic on the single lane road if there is emergency or there is an ambulance on the road the other vehicles give path to the vehicles that are in need. The example for preemptive scheduling is people standing in queue for tickets.
45.what is the difference between start schema and Fsldm?
A star schema has one Fact tables and many dimensional tables to store the related. FSLDM (Financial services logical data model) is a model proposed and designed by NCR to cater the needs of financial sectors, specially the Banking domain. It has many Fact tables and dimension tbls. The major fact tbls include Party (Customers of the bank), Product (Services offered by the bank), Event (An event occurred ex: An ATM transaction), Agreement (A deal between the party and the bank for a product)etc… The FSLDM can be customized to cater to bank’s specific needs.
- FSLDM -> 3rd normal form
- Star Schema –> Denormalized for query performance
46.What are the key columns in Fact and dimension tables?
Foreign keys of dimension tables are primary keys of entity tables. Foreign keys of fact tables are the primary keys of the dimension tables.
47. Explain Granularity.
- Granularity in table represents the level of information stored in the table.
- In BI granularity is very important concept to check the table data.
- The granularity is high and low .
- High granularity data contains the data with high information or you can say it as transaction level data is high granularity data.
- Low granularity means data has low level information only.
- Fact table always have low granularity mean we need very low level data in fact table.
- Following 2 points are important in defining granularity :
- 1.Determining the dimensions that are to be included
- 2.Determining location to place hierarchy of each dimension of information
Real life Example :
Date Dimension Granularity level :
Year,month,quarter,period,week,day
In Diagram i shown the snowflake schema where sales table is a fact table and all are dimensions.Store table is further normalized in to different tables name city,state and region.
- A fact table is usually designed at a low level of Granularity.
- This means that we need to find the lowest level of information that can store in a fact table.e.g.Employee performance is a very high level of granularity. Employee_performance_daily, employee_performance_weekly can be considered lower levels of granularity.
- The granularity is the lowest level of information stored in the fact table.
- The depth of data level is known as granularity.
- In date dimension, the level could be year, month, quarter, period, week, day of granularity.
- dimensional model fact tables are dependent on the dimension tables.
- This means that fact table contains foreign keys to dimension tables.
- This is the reason dimension tables are loaded first and then the fact table
49.what is incremental loading?
- Incremental loading means loading the ongoing changes in the OLTP.
50.Incremental Loading for Fact Tables
METHOD OF LOADING
- Generally speaking, incremental loading for Fact tables is relatively easier as,
- unlike dimension tables, here you do not need to perform any look-up on your target table to find out if the source record already exists in the target or not.
- All you need to do is to select incremental records from source (as shown below for the case of “sales” table) and load them as it is to target (you may need to perform lookup to dimensional tables to assign respective surrogate keys ).
Sales Table
ID CustomerID ProductDescription Qty Revenue Sales Date 1 1 White sheet (A4) 100 4.00 22-Mar-2012 2 1 James Clip (Box) 1 2.50 22-Mar-2012 3 2 Whiteboard Marker 1 2.00 22-Mar-2012 4 3 Letter Envelop 200 75.00 23-Mar-2012 5 1 Paper Clip 12 4.00 23-Mar-2012
SELECT t.* FROM Sales t WHERE t.sales_date > (select nvl( max(b.loaded_until), to_date('01-01-1900', 'MM-DD-YYYY') ) from batch b where b.status = 'Success');
where “batch” is a separate table maintained at target system having minimal structure and data like below
Batch_ID Loaded_Until Status 1 22-Mar-2012 Success 2 23-Mar-2012 Success
However, things may get pretty complicated if your fact is a special type of fact called “snapshot fact”. Let’s understand them below.
Loading Incident Fact
Incident fact is the normal fact that we encounter mostly (and that we have seen above in our sales table example). Records in these types of facts are only loaded if there are transactions coming from the source. For example, if at all there is one sale that happens in the source system, then only a new sales record will come. They are dependent on some real “incident” to happen in the source hence the name incident fact.
Loading Snapshot Fact
As opposed to incident fact, snapshot facts are loaded even if there is no real business incident in the source. above example of customer and sales tables in OLTP. Let’s say I want to build a fact that would show me total revenue of sales from each customer for each day. In effect, I want to see the below data in my fact table.
Sales fact table (This is what I want to see in my target fact table)
Date Customer Revenue 22-Mar-2012 John 6.50 22-Mar-2012 Ryan 2.00 23-Mar-2012 John 10.50 23-Mar-2012 Ryan 2.00 23-Mar-2012 Bakers' 75.00
even if no sales was made to Ryan on 23-Mar, we still show him here with the old data. Similarly for John, even if goods totaling to $4.00 was sold to him on 23-Mar, his record shows the cumulative total of $10.50.
Now obviously the next logical question is how to load this fact using incremental loading? Because incremental loading only brings in incremental data – that is on 23rd March, we will only have Bakers’ and John’s records and that too with that day’s sales figures. We won’t have Ryan record in the incremental set.
Why not a full load
You can obviously opt-in for full load mechanism as that would solve this problem but that would take the toll on your loading performance.
The solution?
One way to resolve this issue is: creating 2 incremental channels of loading for the fact. 1 channel will bring in incremental data from source and the other channel will bring in incremental data from the target fact itself. Let’s see how does it happen below. We will take the example for loading 23-Mar data.
Channel 1: Incremental data from source
Customer Revenue John 4.00 Bakers' 75.00
Channel 2: Incremental data from target fact table (last day’s record)
Customer Revenue John 6.50 Ryan 2.00
Next we can perform a FULL OUTER JOIN between the above two sets to come to below result
John 10.50
Ryan 2.00
Bakers' 75.00
DIMENSIONS
51. What is a dimension table?
- contains the context of measurements
- the dimensions on which the facts are calculated.
- dimensions are groups of hierarchies and descriptors that define the facts.
- Dimensions represent qualitative data
- For example– plan, product, class are all dimensions.
- A dimension table contains descriptive or textual attributes.
- For example, product category & product name are the attributes of product dimension
- A dimensional table is a collection of hierarchies and categories along which the user can drill down and drill up.
- it contains only thetextual attributes.
- which describes the business entities of an enterprise which describes the objects in a fact table.
- Dimension table has primary key which uniquelly identifies each dimension row.
- Dimension table is sometimes called as lookup or reference table.
- The primary key of dimension table is used to associate relationship between fact table which contains foreign key.
- Dimension tables are normally in de-normalized form because these tables are only used to analyse the data and not used to execute transactions.
The fields in a dimension table is used to complete following 3 important requirement :
- Query Constrainting
- Grouping /Filtering
- Report labeling
52.What are the different types of dimension?
In a data warehouse model, dimension can be of following types,
- Conformed Dimension
- Junk Dimension
- Degenerated Dimension
- Role Playing Dimension
Based on how frequently the data inside a dimension changes, we can further classify dimension as
- Unchanging or static dimension (UCD)
- Slowly changing dimension (SCD)
- Rapidly changing Dimension (RCD)
53.What is a ‘Conformed Dimension’?
- A conformed dimension is the dimension that is shared across multiple subject area. Consider ‘Customer’ dimension.
- Both marketing and sales department may use the same customer dimension table in their reports. Similarly, a ‘Time’ or ‘Date’ dimensionwill be shared by different subject areas.
- Theoretically, two dimensions which are either identical or strict mathematical subsets of one another are said to be conformed.
- For example, if subscriberdimension is connected to two fact tables – billing and claim then the subscriber dimension would be treated as conformed dimension
- It might be utilized with different fact tables in a single database or over numerous data marts/warehouses
A Dimension that is utilized as a part of different areas is called as conformed dimension. It might be utilized with different fact tables in a single database or over numerous data marts/warehouses. The Confirmed dimension is the dimension which is connected to two or more fact tables.
Real Life Example:
If Order tableis connected to product fact as well as Order item fact and user needs to fetch data from both tables then Order dimension will be joined to Product as well as Order item table.
- certain data may not be appropriate to store in the schema, this data (or attributes) can be stored in a junk dimension.
- The nature of data of junk dimension is usually Boolean or flag values.
- A single dimension is formed by lumping a number of small dimensions.
- Junk dimension has unrelated attributes.
- A junk dimension is a grouping of typically low-cardinality attributes (flags, indicators etc.)
- so that those can be removed from other tables and can be junked into an abstract dimension table.
- These junk dimension attributes might not be related. The only purpose of this table is to store all the combinations of the dimensional attributes which you could not fit into the different dimension tables otherwise.
- Junk dimensions are often used to implement Rapidly Changing Dimensions in data warehouse.
- For example, it can be member eligibility flag set as ‘Y’ or ‘N’ or any other indicator set astrue/false, any specific comments, etc. if we keep all such indicator attributes in the fact table then its size gets increased. So, we combine all such attributes and put in a single dimension table called as junk dimension having unique junk IDs with a possible combination of all the indicator values.
- It is a dimension table comprising of attributes that don’t have a place in the fact table or in any of the current dimension tables. Generally, these are the properties like flags or indicators.
55.What is degenerated dimension?
A degenerated dimension is a dimension that is derived from fact table and does not have its own dimension table.
A dimension key, such as
- transaction number,
- receipt number,
- Invoice number
- empno
etc. does not have any more associated attributes and hence can not be designed as a dimension table.
- A degenerated dimension is a dimension which is not a fact but presents in the fact table as a primary key.
- It does not have its own dimension table.
- We can also call it as a single attribute dimension table.
But, instead of keeping it separately in a dimension table and putting an additional join, we put this attribute in the fact table directly as a key.
Since it does not have its own dimension table, it can never act a foreign key in fact table
56.What is a role-playing dimension?
- Dimensions are often reused for multiple applications within the same database with different contextual meaning.
- For instance, a “Date” dimension can be used for “Date of Sale”, as well as “Date of Delivery”, or “Date of Hire”.
For example, a date dimension can be used for “Date of Claim”, “Billing date” or “Plan Term date”. So, such a dimension will be called as Role playing dimension. The primary key of Date dimension will be associated with multiple foreign keys in the fact table.
57.What are the different types of SCD’s used in data warehousing?add pics

- Type 0 – Fixed Dimension
- No changes allowed, dimension never changes
- Type 1 – No History
- Update record directly, there is no record of historical values, only current state
- Type 2 – Row Versioning
- Track changes as version records with current flag & active dates and other metadata
- Type 3 – Previous Value column
- Track change to a specific attribute, add a column to show the previous value, which is updated as further changes occur
- Type 4 – History Table
- Show current value in dimension table but track all changes in separate table
- Type 6 – Hybrid SCD
- Utilise techniques from SCD Types 1, 2 and 3 to track change
- Type 0 – The passive method
- Type 1 – Overwriting the old value
- Type 2 – Creating a new additional record
- Type 3 – Adding a new column
- Type 4 – Using historical table
- Type 6 – Combine approaches of types 1,2,3 (1+2+3=6)
Type 0:
A Type 0 dimension is where dimensional changes are not considered. This does not mean that the attributes of the dimension do not change in actual business situation. It just means that, even if the value of the attributes change, history is not kept and the table holds all the previous data.
For example, Subscriber’s DOB is a type-0 SCD because it will always remain the same irrespective of the time.
Type 1:
A type 1 dimension is where history is not maintained and the table always shows the recent data. This effectively means that such dimension table is always updated with recent data whenever there is a change, and because of this update, we lose the previous values.
For example, Subscriber’s address(where the business requires to keep the only current address of subscriber) can be a Type-1 dimension.
Type 2:
A type 2 dimension table tracks the historical changes by creating separate rows in the table with different surrogate keys. Consider there is a customer C1 under group G1 first and later on the customer is changed to group G2. Then there will be two separate records in dimension table like below,
Key | Customer | Group | Start Date | End Date |
---|---|---|---|---|
1 | C1 | G1 | 1st Jan 2000 | 31st Dec 2005 |
2 | C1 | G2 | 1st Jan 2006 | NULL |
Note that separate surrogate keys are generated for the two records. NULL end date in the second row denotes that the record is the current record. Also note that, instead of start and end dates, one could also keep version number column (1, 2 … etc.) to denote different versions of the record.
For example,Subscriber’s address(where the business requires to keep a record of all the previous addresses of the subscriber). In this case, multiple rows for a subscriber will be inserted in the table with his/her different addresses.
Type 3:
A type 3 dimension stored the history in a separate column instead of separate rows. So unlike a type 2 dimension which is vertically growing, a type 3 dimension is horizontally growing. See the example below,
Key | Customer | Previous Group | Current Group |
---|---|---|---|
1 | C1 | G1 | G2 |
This is only good when you need not store many consecutive histories and when date of change is not required to be stored.
For example, Subscriber’s address (where the business requires to keep a record of current & just one previous address). In this case, we can dissolve the ‘address’ column into two different columns – ‘current address’ and ‘previous address’.
Type-4:
In this type of dimension, the historical data is preserved in a separate table. The main dimension table holds only the current data.
- For example, the main dimension table will have only one row per subscriber holding its current address. All other previous addresses of the subscriber will be kept in the separate history table. This type of dimension is hardly ever used.
Type 6:
A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to type 2, but only you add one extra column to denote which record is the current record.
Key | Customer | Group | Start Date | End Date | Current Flag |
---|---|---|---|---|---|
1 | C1 | G1 | 1st Jan 2000 | 31st Dec 2005 | N |
2 | C1 | G2 | 1st Jan 2006 | NULL | Y |
- Hybrid SCDs are a combination of both SCD 1 and SCD 2.
- It may happen that in a table, some columns are important and we need to track changes for them i.e., capture the historical data for them whereas in some columns even if the data changes, we do not have to bother.
- For such tables, we implement Hybrid SCDs, where in some columns are Type 1 and some are Type 2.
59.What is a mini dimension?
Mini dimensions can be used to handle rapidly changing dimension scenario. If a dimension has a huge number of rapidly changing attributes it is better to separate those attributes in different table called mini dimension. This is done because if the main dimension table is designed as SCD type 2, the table will soon outgrow in size and create performance issues. It is better to segregate the rapidly changing members in different table thereby keeping the main dimension table small and performing.
60.Bridge table
Add a bridge table to capture the many-to-many relationship between the OWNER and the PROPERTY tables. The bridge table contains the OWNER_ID and PROPERTY_ID query items. It also contains other query items that provide context or meaning to the relationship, such as share_percentage .
Restrictions
Bridge tables are subject to the following restrictions.
- Dynamic query mode assumes that a bridge table has been used to resolve a many-to-many relationship between dimensions. Design techniques which introduce a bridge table between a dimension and fact table are not supported.x
61.What is a core dimension?
Core dimension is nothing but a Dimension table which is used as dedicated for single fact table or datamart
62.How are the Dimension tables designed?
- Most dimension tables are designed upto 2NF.
- In some instances they are further normalized to 3NF.
- Find where data for this dimension are located.
- Figure out how to extract this data.
- Determine how to maintain changes to this dimension
63.How do you load the time dimension?
- Time dimensions are usually loaded by a program that loops through all possible dates appearing in the data.
- It is not unusual for 100 years to be represented in a time dimension, with one row per day.
64.What are the Different methods of loading Dimension tables?
- Conventional Load:Before loading the data, all the Table constraints will be checked against the data.
- Direct load:(Faster Loading)All the Constraints will be disabled. Data will be loaded directly.Later the data will be checked against the table constraints and the bad data won’t be indexed.
65.can a dimension table contains numeric values?.
- Yes.But those datatype will be char (only the values can numeric/char).
- these are descriptive elements of business
66.What is VLDB?
- The perception of what constitutes a VLDB continues to grow.
- A one terabyte database would normally be considered to be a VLDB.
68.What is rapidly changing dimension?
- This is a dimension where data changes rapidly.
69.Employee health details are hidden from his employer by the health care provider. Which level of data hiding is this? Conceptual, physical or external?
This is the scenario of an external level of data hiding
70.What is the form of fact table & dimension table?
Generally, the fact table is in normalized form and dimension table is in de-normalized form
71. If a unique constraint is applied to a column then will it throw an error if you try to insert two nulls into it?
No, it will not throw any error in this case because a null value is unequal to another null value. So, more than one null will be inserted in the column without any error.
72.Incremental loading for dimensions?
In a dimensional model, we may perform incremental loading for dimension tables also. One may argue that this wont be necessary as data volume in dimension tables are not as high as the data volumes in the fact tables, hence we can simply do a full load every time.
growth in the data in dimension tables and things can get quite heavy especially if we are trying to load SCD type 2 dimensions.
Standard Method of Loading
Like before, for our purpose we will assume we have the below customer table in our source system from where we need to perform the data loading.
CustomerID CustomerName Type LastUpdatedDate 1 John Individual 22-Mar-2012 2 Ryan Individual 22-Mar-2012 3 Bakers' Corporate 23-Mar-2012
As discussed in the previous article, a typical SQL query to extract data incrementally from this source system will be like this:
SELECT t.* FROM Customer t WHERE t.lastUpdatedDate > (select nvl( max(b.loaded_until), to_date('01-01-1900', 'MM-DD-YYYY') ) from batch b where b.status = 'Success');
Here “batch” is a separate table which stores the date until which we have successfully extracted the data.
Batch_ID Loaded_Until Status 1 22-Mar-2012 Success 2 23-Mar-2012 Success
73.Which one to use: “Entry Date” / “Load Date” or “Last Update Date”?
In an incremental load methodology, we should extract the record when it is first created and after that whenever the record is updated. Therefore, we should always look for “last update date” column for extracting records. This is because, “entry date” or “load date” columns in the source systems are not enough to determine if the record is updated in the later point in time.
Often source systems maintain 2 different columns as load_date and last_update_date. When extracting data based on “last update date”, ensure that source systems always populate “last updated date” field with “load date” when the record is first created.
74.What are the benefits of incremental loading of dimension tables?
- Once we extract records incrementally based on their last update date, we cancompare each record with the target based on their natural keys and determine if the record is a new record or updated record.
- However, if we do not extract incrementally (and every time extract all the records from source), then the number of records to compare against target will be much higher resulting into performance degradation.
- If we are doing incremental loading, records that do not have any change will not come – only new or updatable records will come. But if we are doing full load, everything will come irrespective of any change.
75.History Preserving in Dimensional Modeling
One of the important objectives while doing data modeling is, to develop a model which can capture the states of the system with respect to time. You know, nothing lasts forever! Product prices change over time, people change their addresses, marital status, employers and even their names. If you are doing data modeling for a data warehouse — where we are particularly interested about historical analysis – it is crucial that we develop some method of capturing these changes in our data model
76.Types of Changing Dimensions
Unchanging Dimension
There are some dimensions that do not change at all. For example, let’s say you have created a dimension table called “Gender“. Below are the structure and data of this dimension table: Gender
ID | VALUE |
---|---|
1 | Male |
2 | Female |
Slowly Changing Dimension
SCD Type 1
As mentioned above, we design a dimension as SCD type 1 when we do not want to store the history. That is, whenever some values are modified in the attributes, we just want to update the old values with the new values and we do not care about storing the previous history.
We do not store any history in SCD Type 1
Please mind, this is not same as “Unchanged Dimension” discussed in the previous article. In case of an unchanged dimension, we assume that the values of the attributes of that dimension will not change at all. On the other hand, here in case of a SCD Type 1 dimension, we assume that the values of the attributes will change slowly, however, we are not interested to store those changes. We are only interested to store the current or latest value. So every time it changes we will update the old value with new ones.
Handling SCD Type 1 Dimension in ETL Process
SCD Type 1 dimensions are loaded using “Merge” operation which is also known as “UPSERT” as an abbreviation of “Update else Insert”.
SCD Type 1 dimensions are loaded by Merge operations
In pure ANSI SQL syntax, there is a particular statement that help you achieve the UPSERT operation. It’s called “MERGE” statement
SCD Type 2
We will take our “Food” dimension table as an example here, where “Price” is a variable factor. Food
KEY | NAME | TYPE_KEY | PRICE |
---|---|---|---|
1 | Chicken Burger | 1 | 3.70 |
2 | VeggieBurger | 1 | 3.20 |
3 | French Fries | 2 | 2.00 |
4 | Twister Fries | 2 | 2.20 |
Design of SCD Type 2 Dimension
In order to design the above table as SCD Type 2, we will have to add 3 more columns in this table, “Date From”, “Date To” and “Latest Flag“. These columns are called type 2 metadata columns. See below: Food
KEY | NAME | TYPE_KEY | PRICE | DATE_FROM | DATE_TO | LATEST_FLG |
---|---|---|---|---|---|---|
1 | Chicken Burger | 1 | 3.70 | 01-Jan-11 | 31-Dec-99 | Y |
2 | Veggie Burger | 1 | 3.20 | 01-Jan-11 | 31-Dec-99 | Y |
3 | French Fries | 2 | 2.00 | 01-Jan-11 | 31-Dec-99 | Y |
4 | Twister Fries | 2 | 2.20 | 01-Jan-11 | 31-Dec-99 | Y |
3 metadata columns
KEY | NAME | TYPE_KEY | PRICE | DATE_FROM | DATE_TO | LATEST_FLAG |
---|---|---|---|---|---|---|
1 | Chicken Burger | 1 | 3.70 | 01-Jan-11 | 31-Dec-99 | Y |
2 | Veggie Burger | 1 | 3.20 | 01-Jan-11 | 14-Mar-11 | N |
3 | French Fries | 2 | 2.00 | 01-Jan-11 | 31-Dec-99 | Y |
4 | Twister Fries | 2 | 2.20 | 01-Jan-11 | 31-Dec-99 | Y |
5 | Veggie Burger | 1 | 3.25 | 15-Mar-11 | 31-Dec-99 | Y |
KEY | NAME | TYPE_KEY | PRICE | DATE_FROM | DATE_TO | LATEST_FLG |
---|---|---|---|---|---|---|
1 | Chicken Burger | 1 | 3.70 | 01-Jan-11 | 19-Dec-11 | N |
2 | Veggie Burger | 1 | 3.20 | 01-Jan-11 | 14-Mar-11 | N |
3 | French Fries | 2 | 2.00 | 01-Jan-11 | 31-Dec-99 | Y |
4 | Twister Fries | 2 | 2.20 | 01-Jan-11 | 31-Dec-99 | Y |
5 | Veggie Burger | 1 | 3.25 | 15-Mar-11 | 19-Dec-11 | N |
6 | Chicken Burger | 1 | 3.80 | 20-Dec-11 | 31-Dec-99 | Y |
7 | Veggie Burger | 1 | 3.20 | 20-Dec-11 | 31-Dec-99 | Y |
As you can see from the design above, it is now possible to go back to any date in the history and figure out what was the value of the “Price” attribute of “Food” dimension at that point in time.
Surrogate key for SCD Type 2 dimension
Note from the above example that, each time we generate a new row in the dimension table, we also assign a new key to the record.
This is the key that flows down to the fact table in a typical Star schema design. The value of this key, that is the numbers like 1, 2, 3, …. , 7 etc. are not coming from the source systems.
Instead those numbers are just like sequential running numbers which are generated automatically at the time of inserting these records.
These numbers are unique, so as to uniquely identify each record in the table, and are called “Surrogate Key” of the table.
As obvious, multiple surrogate keys may be related to the same item, however, each key will relate to one particular state of that item in time. In the above example, keys 2, 5 and 7 are all linked to “Veggie Burger” but they represent the state of the record in 3 different time spans. It’s worth noting that there would be only one record with latest flag = “Y” among multiple records of the same item.
Alternate Design of SCD Type 2: Addition of Version Number
In this design pattern, the records with highest version will always be the latest record. If we utilize this design in our earlier example, the dimension table will look like this: Food
KEY | NAME | TYPE_KEY | PRICE | DATE_FROM | DATE_TO | VERSION |
---|---|---|---|---|---|---|
1 | Chicken Burger | 1 | 3.70 | 01-Jan-11 | 19-Dec-11 | 1 |
2 | Veggie Burger | 1 | 3.20 | 01-Jan-11 | 14-Mar-11 | 1 |
3 | French Fries | 2 | 2.00 | 01-Jan-11 | 31-Dec-99 | 1 |
4 | Twister Fries | 2 | 2.20 | 01-Jan-11 | 31-Dec-99 | 1 |
5 | Veggie Burger | 1 | 3.25 | 15-Mar-11 | 19-Dec-11 | 2 |
6 | Chicken Burger | 1 | 3.80 | 20-Dec-11 | 31-Dec-99 | 2 |
7 | Veggie Burger | 1 | 3.20 | 20-Dec-11 | 31-Dec-99 | 3 |
Off course, we can also keep the “Latest Flag” column in the above table if we wish.
Handling SCD Type 2 Dimension in ETL Process
unlike SCD Type 1, Type 2 requires you to insert new records in the table as and when any attribute changes.
This is obviously different from SCD Type 1.
Because in case of SCD Type 1, we were only updating the record. But here, we will need to update old record (e.g. changing the latest flag from “Y” to “N”, updating the “Date To”) as well as we will need to insert a new record.
Like before, we can use the “natural key” to first compare if the source record is existing in the target or not. If not, we will simply insert the record in the target with new surrogate key.
But if it already exists in the target, we will have to check if any value of the attributes has changed between source and target – if not, we can ignore the source record.
But if yes, we will have to update the existing record as “N” and insert a new record with new surrogate key.
77.Performance Considerations of SCD Type 2 Dimension
refer above ans too
SCD type 2, by design, tend to increase the volume of the dimension tables considerably.
Think of this: Let’s say you have an “employee” dimension table which you have designed as SCD Type 2. The employee dimensions has 20 different attributes and there are 10 attributes in this table which change at least once in a year on average (e.g. employee grade, manager’s name, department, salary, band, designation etc.). This means if you have 1,000 employees in your company, at the end of just one year, you are going to get 10,000 records in this dimension table (i.e. assuming on an average 10 attributes change per year – resulting into 10 different rows in the dimension table).
As you can see, this is not a very good thing performance wise as this can considerably slow down loading of your fact table as you will require to “look up” this dimension table during your fact loading.
One may argue that, even if we have 10,000 records, we will actually have only 1,000 records with Latest_Flag = ‘Y’ and since we will only lookup records with Latest_Flag = ‘Y’, the performance will not deteriorate.
This is not entirely true. While utilizing the Latest_Flag = ‘Y’ filter may decrease the size of the lookup cache, but database will generally need to do a full table scan (FTS) to identify latest records.
Moreover, in many cases ETL developer will not be able to make use of Latest_Flag = ‘Y’ column if the transactional records do not always belong to the latest time (e.g. late arriving fact records or loading fact table at later point in time – month end load / week end load etc.).
In those cases, putting latest_flag = ‘Y’ filter will be functionally incorrect as you should determine the correct return key on the basis of “Date To”, “Date From” columns.
SCD Type 3
In Type 2 design above, we have seen that whenever the values of the attributes change, we insert new rows to the table. In case of type 3, however, we add new column to the table to store the history.
So let’s say, we have a table where we have 2 column initially – “Key” and “attribute”.
KEY | ATTRIBUTE |
---|---|
1 | A |
2 | B |
3 | C |
If the record 1 changes its attribute from A to D, we will add one extra column to the table to store this change.
KEY | ATTRIBUTE | ATTRIBUTE_OLD |
---|---|---|
1 | D | A |
2 | B | |
3 | C |
If the record again change attribute values, we will again have to add columns to store the history of the changes
KEY | ATTRIBUTE | ATTRIBUTE_OLD | ATTRIBUTE_OLD_1 |
---|---|---|---|
1 | E | D | A |
2 | B | ||
3 | C |
Isn’t then SCD Type 3 very cumbersome?As you can see, storing the history in terms of changing the structure of the table in this way is quite cumbersome and after the attributes are changed a few times the table will become unnecessarily big and fat and difficult to manage.But that does not mean SCD Type 3 design methodology is completely unusable. In fact, it is quite usable in a particular circumstance – where we just need to store the partial history information.
Let’s think about a special circumstance where we only need to know the “current value” and “previous value” of an attribute. That is, even though the value of that attribute may change numerous times, at any time we are only concerned about its current and previous values.
In such circumstances, we can design the table as type 3 and keep only 2 columns – “current value” and “previous value” like below.
KEY | CURRENT_VALUE | PREVIOUS_VALUE |
---|---|---|
1 | D | A |
2 | B | |
3 | C |
wherein a certain calculated field in the report used to depend on the latest and previous values of the customer status.
That calculated attribute was called “Churn Indicator” (churn in telecom business generally means leaving a telephone connection) and the rule
correct value of churn indicator, you do not need to know complete history of changes of customer’s status. All you need to know is the current and previous status. In this kind of partial history scenario, SCD Type 3 design is very useful.
Note here, compared to SCD Type 2, type 3 does not increase the number of records in the table thereby easing out performance concerns.
78.Rapidly Changing Dimensions
If you design a dimension table that has a rapidly changing attribute, then your dimension table will become rapidly changing dimension.
Every month, the status of any subscriber keeps on changing multiple times based on his or her account balance thereby making the “Subscribers” dimension one rapidly changing dimension.
.
Implementing Rapidly changing dimension
Handling rapidly changing dimensions are tricky due to various performance implications. This article attempts to provide some methodologies on handling rapidly changing dimensions in a data warehouse.
the problem with type 2 slowly changing dimension is, with every change in the dimensional attributes it increases the number of rows in the table. If lot of changes happen in the attributes of the dimension table (that is to say that the dimension is rapidly changing), the table quickly becomes bulky causing considerable performance issues. Hence the typical solution of SCD Type 2 dimensions may not be a very good fit for rapidly changing scenarios.
Junk Dimension
The method that we are going to consider here assumes the fact that, not all the attributes of a dimension table are rapidly changing in nature.
There might be a few attributes which are changing quite often and some other attributes which seldom change. If we can separate the fast changing attributes from the slowly changing ones and move them in some other table while maintaining the slowly changing attributes in the same table, we can get rid of the issue of bulking up the dimension table.
So let’s take one example to see how it works. Let’s say CUSTOMER dimension has following columns:
- CUSTOMER_KEY
- CUSTOMER_NAME
- CUSTOMER_GENDER
- CUSTOMER_MARITAL_STATUS
- CUSTOMER_TIER
- CUSTOMER_STATUS
While attributes like name, gender, marital status etc. do not change at all or rarely change,
let’s assume customer tier and status change every month based on customer’s buying pattern. If we decide to keep status and tier in the same SCD Type 2 Customer dimension table, we could risk filling-up the table too much too soon. Instead, we can pull out those two attributes in yet another table, which some people refer as JUNK DIMENSION. Here is how our junk dimension will look like. In this case, it will have 3 columns as shown below.
Junk Dimension Structure
- SEGMENTATION_KEY
- TIER
- STATUS
The column SEGMENTATION_KEY is a surrogate key. This acts as the primary key of the table. Also since we have removed status and tier from our main dimension table, the dimension table now looks like this:
- CUSTOMER_KEY
- CUSTOMER_NAME
- CUSTOMER_GENDER
- CUSTOMER_MARITAL_STATUS
Next, we must create a linkage between the above customer dimension to our newly created JUNK dimension. Note here, we can not simply pull the primary key of the JUNK dimension (which we are calling as SEGMENTATION_KEY) into the customer dimension as foreign key. Because if we do so, then any change in JUNK dimension will require us to create a new record in Customer dimension to refer to the changed key. This would in effect again increase the data volume of the dimension table. We solve this problem by creating one more mini table in between the original customer dimension and the junk dimension. This mini dimension table acts as a bridge between them. We also put “start date” and “end date” columns in this mini table so that we can track the history. Here is how our new mini table looks like:
Mini Dimension Structure
- CUSTOMER_KEY
- SEGMENTATION_KEY
- START_DATE
- END_DATE
This table does not require any surrogate key. However, one may include one “CURRENT FLAG” column in the table if required. Now the whole model looks like this:
Maintaining the Junk Dimension
If number of attributes and the number of possible distinct values per attributes (cardinality) are not very large in the Junk dimension, we can actually pre-populate the junk dimension once and for all. In our earlier example, let’s say possible values of status are only “Active” and “Inactive” and possible values of Tier are only “Platinum”, “Gold” and “Silver”. That means there can be only 3 X 2 = 6 distinct combinations of records in this table. We can pre-populate the table with these 6 records from segmentation key = 1 to 6 and assign one key to each customer based on the customers status and tier values.
How does this Junk dimension help?
Since the connection between the segmentation key and customer key is actually maintained in the mini dimension table, frequent changes in tier and status do not change the number of records in the dimension table. Whenever a customer’s status or tier attribute changes, a new row is added in the mini dimension (with START_DATE = date of change of status) signifying the current relation between the customer and the segmentation.
It’s also worth mentioning that in this schema, we can manage the original customer dimension table in SCD type 1 or Type 2 methods, but we will have to take extra care to update the mini dimension also as and when there is a change in the key in the original dimension table.
ccx
Hierarchy
- Hierarchies Hierarchies are logical structures that use ordered levels as a means of organizing data.
- A hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level.
- A hierarchy can also be used to define a navigational drill path and to establish a family structure.Within a hierarchy, each level is logically connected to the levels above and below it.
- Data values at lower levels aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. For example, in the product dimension, there might be two hierarchies–one for product categories and one for product suppliers
- Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to enable you to drill down into your data to view different levels of granularity.
- This is one of the key benefits of a datawarehouse. When designing hierarchies, you must consider the relationships in business structures. For example, a divisional multilevel sales organization.Hierarchies impose a family structure on dimension values. For a particular level value, a value at the next higher level is its parent, and values at the next lower level are its children. These familial relationships enable analysts to access data quickly
- A level represents a position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the month, quarter, and year levels.
- Levels range from general to specific, with the root level as the highest or most general level.
- The levels in a dimension are organized into one or more hierarchies.
- Level relationships specify top-to-bottom ordering of levels from most general (the root) to most specific information.
- They define the parent-child relationship between the levels in a hierarchy.
- Hierarchies are also essential components in enabling more complex rewrites.
- For example, the database can aggregate an existing sales revenue on a quarterly base to a yearly aggregation when the dimensional dependencies between quarter and year are known.
Hierarchies is series of many to one relationships which has different levels.
80.Storing hierarchical information in dimension tables
RELATIONSHIPS
81.Why are recursive relationships are bad? How do you resolve them?
A recursive relationship occurs when there is a relationship between an entity and itself. For example, a one-to-many recursive relationship occurs when an employee is the manager of other employeess.
The employee entity is related to itself, and there is a one-to-many relationship between one employee (the manager) and many other employees (the people who report to the manager).
- A recursive relationship occurs in the case where an entity is related to itself.
- Talking about health care domain, it is a possibility that a health care provider (say, a doctor) is a patient to any other health care provider.
- Because, if the doctor himself falls ill and needs a surgery, he will have to visit some other doctor for getting the surgical treatment.
- So, in this case, the entity – health care provider is related to itself.
- A foreign key of the health insurance provider’s number will have to present in each member’s (patient) record
A relationship between two entities of similar entity type is called a recursive relationship. Here the same entity type participates more than once in a relationship type with a different role for each instance.
In other words, a relationship has always been between occurrences in two different entities. However, it is possible for the same entity to participate in the relationship. This is termed a recursive relationship.
Example –
Let us suppose that we have an employee table. A manager supervises a subordinate. Every employee can have a supervisor except the CEO and there can be at most one boss for each employee. One employee may be the boss of more than one employee. Let’s suppose that REPORTS_TO is a recursive relationship on the Employee entity type where each Employee plays two roles
- Supervisor
- Subordinate
Supervisor and Subordinate are called “Role Names”. Here the degree of the REPORTS_TO relationship is 1 i.e. a unary relationship.
- The minimum cardinality of Supervisor entity is ZERO since the lowest level employee may not be a manager for anyone.
- The maximum cardinality of Supervisor entity is N since an employee can manage many employees.
- Similarly the Subordinate entity has a minimum cardinality of ZERO to account for the case where CEO can never be a subordinate.
- It maximum cardinality is ONE since a subordinate employee can have at most one supervisor
82.What is self-recursive relationship?
A standalone column in a table will be connected to the primary key of the same table, which is called as recursive relationship
83.What is identifying relationship?ex
- Usually, in a data model, parent tables and child tables are present. Parent table and child table are connected by a relationship line.
- If the referenced column in the child table is a part of the primary key in the child table, relationship is drawn by thick lines by connecting these two tables, which is called as identifying relationship.
84.What is non-identifying relationship?ex
- Usually, in a data model, parent tables and child tables are present. Parent table and child table are connected by a relationship line.
- If the referenced column in the child table is a not a part of the primary key and standalone column in the child table, relationship is drawn by dotted lines by connecting these two tables, which is called as non-identifying relationship.
xx
METADATA
85.What is meant by metadata in context of a Datawarehouse and how it is important?
- source (where and how the data is originated),
- nature of data (char, varchar, nullable, existance, valid values etc)
- and behavior of data (how it is modified / derived and the life cycle ) in data dictionary a.k.a metadata.
-
metadata is the summarized data that leads us to the detailed data.
86.What is the difference between metadata and data dictionary?
- Metadata is defined as data about the data. But, Data dictionary contain the information about the project information, graphs, abinito commands and server information
- definition of data warehouse
- Metadata repository is an integral part of a data warehouse system. It contains the following metadata −
- Business metadata − It contains the data ownership information, business definition, and changing policies.
- Operational metadata − It includes currency of data and data lineage. Currency of data refers to the data being active, archived, or purged. Lineage of data means history of data migrated and transformation applied on it.
- Data for mapping from operational environment to data warehouse − It metadata includes source databases and their contents, data extraction, data partition, cleaning, transformation rules, data refresh and purging rules.
- The algorithms for summarization − It includes dimension algorithms, data on granularity, aggregation, summarizing, etc.
Categories of Metadata
Metadata can be broadly categorized into three categories −
- Business Metadata − It has the data ownership information, business definition, and changing policies.
- Technical Metadata − It includes database system names, table and column names and sizes, data types and allowed values. Technical metadata also includes structural information such as primary and foreign key attributes and indices.
- Operational Metadata − It includes currency of data and data lineage. Currency of data means whether the data is active, archived, or purged. Lineage of data means the history of data migrated and transformation applied on it.
KEY
87.What is a pseudo key? What is its use?
If a table in a database either has no natural key or the natural key is so long that using it as the key would be inefficient then we need to consider using a pseudo key instead. Pseudo keys are usually given the same name as the table they belong to with _id added to the end of the name. They usually are defined to contain a number and use auto increment in my SQL or the equivalent in other versions of SQL to dynamically allocate a value when new rows are created.
- surrogate key is a substitution for the natural primary key.
- It is just a unique identifier or number for each row that can be used for the primary key to the table.
- The only requirement for a surrogate primary key is that it is unique for each row in the table.
- Data warehouses typically use a surrogate key for the dimension tables primary keys. They can use sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key.
- It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME
- another benefit you can get from surrogate keys (SID) is :
- Tracking the SCD – Slowly Changing Dimension.
- You could consider Slowly Changing Dimension as an enlargement of your natural key: natural key of the Employee was Employee Code ‘E1’ but for you it becomes Employee Code + Business Unit – ‘E1’ + ‘BU1’ or ‘E1’ + ‘BU2.’ But the difference with the natural key enlargement process, is that you might not
.What is an artificial (derived) primary key? When should it be used?
Using a name as the primary key violates the principle of stability. The social security number might be a valid choice, but a foreign employee might not have a social security number. This is a case where a derived, rather than a natural, primary key is appropriate. A derived key is an artificial key that you create. A natural key is one that is already part of the database
88.What is a primary key constraint?
Primary key constraint is imposed on the column data to avoid null values and duplicate values. Primary Key=Unique + Not Null. Example: social security number, bank account number, bank routing number
89.What is a composite primary key constraint?
When more than one column is a part of the primary key, it is called as composite primary key constraint.
90.What is a foreign key constraint?
Parent table has primary key and a foreign key constraint is imposed on a column in the child table. The foreign key column value in the child table will always refer to primary key values in the parent table.
91.What is a composite foreign key constraint?
When group of columns are in a foreign key, it is called as composite foreign key constraint
92.What is a constraint? What are the different types of constraint?
Constraint is a rule imposed on the data. The different types of constraints are
- primary key,
- unique,
- not null,
- foreign key,
- composite foreign key,
- check constraint .
93.What is a unique constraint?
Unique constraint is imposed on the column data to avoid duplicate values, but it will contain NULL values.
94.What is a check constraint?
Check constraint is used to check range of values in a column
95.Candidate Key:
The minimal set of attribute which can uniquely identify a tuple is known as candidate key. For Example, STUD_NO in STUDENT relation.
- The value of Candidate Key is unique and non-null for every tuple.
- There can be more than one candidate key in a relation. For Example, STUD_NO as well as STUD_PHONE both are candidate keys for relation STUDENT.
- The candidate key can be simple (having only one attribute) or composite as well. For Example, {STUD_NO, COURSE_NO} is a composite candidate key for relation STUDENT_COURSE.
Note – In Sql Server a unique constraint that has a nullable column, allows the value ‘null‘ in that column only once. That’s why STUD_PHONE attribute as candidate here, but can not be ‘null’ values in primary key attribute.
96.Super Key:
The set of attributes which can uniquely identify a tuple is known as Super Key. For Example, STUD_NO, (STUD_NO, STUD_NAME) etc.
- Adding zero or more attributes to candidate key generates super key.
- A candidate key is a super key but vice versa is not true.
97.Primary Key:
There can be more than one candidate key in a relation out of which one can be chosen as primary key. For Example, STUD_NO as well as STUD_PHONE both are candidate keys for relation STUDENT but STUD_NO can be chosen as primary key (only one out of many candidate keys).
98.Alternate Key:
The candidate key other than primary key is called as alternate key. For Example, STUD_NO as well as STUD_PHONE both are candidate keys for relation STUDENT but STUD_PHONE will be alternate key (only one out of many candidate keys).
99.Foreign Key:
If an attribute can only take the values which are present as values of some other attribute, it will be foreign key to the attribute to which it refers. The relation which is being referenced is called referenced relation and corresponding attribute is called referenced attribute and the relation which refers to referenced relation is called referencing relation and corresponding attribute is called referencing attribute. Referenced attribute of referenced relation should be primary key for it. For Example, STUD_NO in STUDENT_COURSE is a foreign key to STUD_NO in STUDENT relation.
It may be worth noting that unlike, Primary Key of any given relation, Foreign Key can be NULL as well as may contain duplicate tuples i.e. it need not follow uniqueness constraint.
For Example, STUD_NO in STUDENT_COURSE relation is not unique. It has been repeated for the first and third tuple. However, the STUD_NO in STUDENT relation is a primary key and it needs to be always unique and it cannot be null.an aggregate function or aggregation function is a function where the values of multiple rows are grouped together to form a single summary value.