TOP DATA MODELLING QUESTIONS

1.Categories of data model:

High Level / Conceptual data model.
  •  the way many users perceive data.
  • It uses concepts such as Entities, Attributes and Relationships.

Logical Data Model:

  • Entities, Attributes, Super Types, Sub Types, Primary Key, foreign keys,Alternate Key, Inversion Key Entry, Rule, Relationship, Definition, business rule, etc
  • representation of your database in  logical way.
  • actual implementation and extension of a conceptual data model.
Low Level/ Physical data model.
  • how data is stored in the computer by representing information such as record formats, record orderings and access path.

Physical Data Model :

  • primary keys, foreign keys, table names, column names and column data types, tables, columns, relationship, database properties for the physical implementation of databases.
  • representation of physical database.
  • Database performance, indexing strategy, and physical storage are
  • The approach by which physical data models are created is called as physical data modeling.
  • The physical data model  These entities were subscriber, member, healthcare provider, claim, bill, enrollment, group, eligibility, plan/product, commission, capitation, etc.

Representational/ Implementation data model.

  • They include Hierarchical data model, Network data model and Relational data model.

2.What does data model contain?

Logical Data Model: Entity, Attributes, Super Type, Sub Type, Primary Key, Alternate Key, Inversion Key Entry, Rule, Relationship, Definition, business rule, etc

Physical Data Model: Table, Column, Primary key Constraint, Unique Constraint or Unique Index, Non Unique Index, Check Constraint, Default Value, Foreign Key, comment etc.

Captur11e

conceptual data model showing how the project looked like on a high-level

3.

Capture222

4.What is Dimensional Modelling?

  •  concept –> to build their datawarehouse.
  • In this design model all the data is stored in two types of tables
  • Fact table
    •  facts/measurementsof the business and
    • different transactional measurements
    • foreign keys from dimension tables that qualifies the data.
    •  contains numerical value which can be aggregated and analyzed on fact values.
    • Fact tables are used to store various transactional measurements from “dimension tables” that qualifies the data.
    • Facts are typically (but not always) numeric values that can be aggregated, 

 

  • Dimension table
    • contains the context of measurements
    •  the dimensions on which the facts are calculated.
    • Dimension defines hierarchies and description of fact values
    •  dimensions are groups of hierarchies and descriptors that define the facts.
  • The goal of Dimensional model is not to achieve high degree of normalization but to facilitate easy and faster data retrieval.
    • Faster Data Retrieval
    • Better Understandability
    • Extensibility
  • Dimensional modeling is often used in Data warehousing.
    • A snowflake and
    • star schema represent data modeling
    • Conformed

5.

6.Types of Fact tables

Mainly 4 types

  • Transactional fact
  • Periodic snapshots
  • Accumulating snapshots
  • Temporal snapshots

others :

  • fact-less-fact
  • coverage fact
  • incident and snapshot facts
  • conformed fact
  • aggregate fact table
  • cummulative fact 
  • centipede fact 

There are four fundamental measurement events, which characterize all fact tables

Transactional

  • A transactional table is the most basic and fundamental.
  • The grain associated with a transactional fact table is usually specified as “one row per line in a transaction”, e.g., every line on a receipt.
  • Typically a transactional fact table holds data of the most detailed level, causing it to have a great number of dimensions associated with it.
  • These fact tables represent an event that occurred at an instantaneous point in time.
  • A row exists in the fact table for a given customer or product only if a transaction has occurred.
  • A given customer or product is likely linked to multiple rows in the fact table because the customer or product is involved in more than one transaction.
  • Transaction data often is structured quite easily into a dimensional framework.
  • The lowest-level data is the most natural dimensional data, supporting analyses that cannot be done on summarized data.
  • Unfortunately, even with transaction-level data, there is still a whole class of urgent business questions that are impractical to answer using only transaction detail.

Periodic snapshots

  • The periodic snapshot, as the name implies, takes a “picture of the moment“, where the moment could be any defined period of time, e.g. a performance summary of a salesman over the previous month.
  • A periodic snapshot table is dependent on the transactional table, as it needs the detailed data held in the transactional fact table in order to deliver the chosen performance output.
  • This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. 
  • Example: Daily balances fact can be summed up through the customers’ dimension but not through the time dimension.
  • Periodic snapshots are needed to see the cumulative performance of the business at regular, predictable time intervals.
  • Unlike the transaction fact table, where we load a row for each event occurrence, with the periodic snapshot, we take a picture of the activity at the end of a day, week, or month, then another picture at the end of the next period, and so on.
  • Example: A performance summary of a salesman over the previous month.

Accumulating snapshots

  • This type of fact table is used to show the activity of a process that has a well-defined beginning and end, e.g., the processing of an order.
  • An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated.
  • An accumulating snapshot table often has multiple date columns, each representing a milestone in the process.
  • Therefore, it’s important to have an entry in the associated date dimension that represents an unknown date, as many of the milestone dates are unknown at the time of the creation of the row.
  • This type of fact table is used to show the activity of a process that has a well-defined beginning and end. For example, the processing of an order.
  • An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated.
  • Accumulating snapshots almost always have multiple date stamps, representing the predictable major events or phases that take place during the course of a lifetime.
  • Often there’s an additional date column that indicates when the snapshot row was last updated. Since many of these dates are not known when the fact row was first loaded, we must use surrogate date keys to handle undefined dates.
  • In sharp contrast to the other fact table types, we revisit accumulating snapshot fact table rows to update them.
  • Unlike the periodic snapshot, where we hang onto the prior snapshot, the accumulating snapshot merely reflects the accumulated status and metrics. Sometimes accumulating and periodic snapshots work in conjunction with one another.

 

Temporal snapshots

  • By applying temporal database theory and modeling techniques the temporal snapshot fact table  allows to have the equivalent of daily snapshots without really having daily snapshots.
  • It introduces the concept of time Intervals into a fact table, allowing to save a lot of space, optimizing performances while allowing the end user to have the logical equivalent of the “picture of the moment” they are interested in.

Centipede Fact Table

Centipede fact table is a normalized fact table. Modeller may decide to normalize the fact instead of snow flaking dimensions tables.

Conformed Fact Tables

They are measures re-used across multiple dimension models.

For example, KPI such as profit, revenue etc

Incident and Snapshot Facts

A fact table stores some kind of measurements and are captured against a specific time. Now it might so happen that the business might not able to capture all of its measures always for every point in time.

Then those unavailable measurements can be kept empty (Null) or can be filled up with the last available measurements. The first case is the example of incident fact and the second one is the example of snapshot fact.

Cumulative Fact

This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day

Fact-less Fact Tables

A fact table that does not contain any measure is a fact-less fact table. This table will only contain keys from different dimension tables. This is often used to resolve a many-to-many cardinality issue.

7.What are incident and snapshot facts

  • A fact table stores some kind of measurements.
  • Usually these measurements are stored (or captured) against a specific time and these measurements vary with respect to time.
  • Now it might so happen that the business might not able to capture all of its measures always for every point in time.
  • Then those unavailable measurements can be kept empty (Null) or can be filled up with the last available measurements.
  • The first case is the example of incident fact and the second one is the example of snapshot fact.

8.What is conformed fact?

  • Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiple facts tables accordingly.
  • • A conformed dimension is a dimension that has exactly the same meaning and content when being referred from different fact tables.
  • A conformed dimension can refer to multiple tables in multiple data marts within the same organization.

9. What is a dimension table?

  • contains the context of measurements
  •  the dimensions on which the facts are calculated.
  •  dimensions are groups of hierarchies and descriptors that define the facts.
  • Dimensions represent qualitative data
  •  For example– plan, product, class are all dimensions.
  • A dimension table contains descriptive or textual attributes.
  • For example, product category & product name are the attributes of product dimension
  •  A dimensional table is a collection of hierarchies and categories along which the user can drill down and drill up.
  • it contains only thetextual attributes.
  • which describes the business entities of an enterprise which describes the objects in a fact table.
  • Dimension table has primary key which uniquelly identifies each dimension row.
  • Dimension table is sometimes called as lookup or reference table.
  • The primary key of dimension table is used to associate relationship between fact table which contains foreign key.
  • Dimension tables are normally in de-normalized form because these tables are only used to analyse the data and not used to execute transactions.

The fields in a dimension table is used to complete following 3 important requirement :

  1. Query Constrainting
  2. Grouping /Filtering
  3. Report labeling
  4. The primary functions of the dimensions are as follows:

    • Filtering
    • Grouping
    • Labelling

 

10.What are the different types of dimension?

In a data warehouse model, dimension can be of following types,

  1. Conformed Dimension
  2. Junk Dimension
  3. Degenerated Dimension
  4. Role Playing Dimension

Based on how frequently the data inside a dimension changes, we can further classify dimension as

  1. Unchanging or static dimension (UCD)
  2. Slowly changing dimension (SCD)
  3. Rapidly changing Dimension (RCD)

11.What is a ‘Conformed Dimension’?

  • A conformed dimension is the dimension that is shared across multiple subject area. Consider ‘Customer’ dimension.
  • Both marketing and sales department may use the same customer dimension table in their reports. Similarly, a Time’ or ‘Date’ dimensionwill be shared by different subject areas.
  • Theoretically, two dimensions which are either identical or strict mathematical subsets of one another are said to be conformed.
  • For example, if subscriberdimension is connected to two fact tables – billing and claim then the subscriber dimension would be treated as conformed dimension
  • It might be utilized with different fact tables in a single database or over numerous data marts/warehouses

A Dimension that is utilized as a part of different areas is called as conformed dimension. It might be utilized with different fact tables in a single database or over numerous data marts/warehouses. The Confirmed dimension is the dimension which is connected to two or more fact tables.

Real Life Example:

If Order tableis connected to product fact as well as Order item fact and user needs to fetch data from both tables then Order dimension will be joined to Product as well as Order item table.

  •   certain data may not be appropriate to store in the schema, this data (or attributes) can be stored in a junk dimension.
  • The nature of data of junk dimension is usually Boolean or flag values.
  • A single dimension is formed by lumping a number of small dimensions.
  • Junk dimension has unrelated attributes.
  • A junk dimension is a grouping of typically low-cardinality attributes (flags, indicators etc.)
  • so that those can be removed from other tables and can be junked into an abstract dimension table.
  • These junk dimension attributes might not be related. The only purpose of this table is to store all the combinations of the dimensional attributes which you could not fit into the different dimension tables otherwise.
  • Junk dimensions are often used to implement Rapidly Changing Dimensions in data warehouse.
  • For example, it can be member eligibility flag set as Y’ or ‘N or any other indicator set astrue/false, any specific comments, etc. if we keep all such indicator attributes in the fact table then its size gets increased. So, we combine all such attributes and put in a single dimension table called as junk dimension having unique junk IDs with a possible combination of all the indicator values.
  • It is a dimension table comprising of attributes that don’t have a place in the fact table or in any of the current dimension tables. Generally, these are the properties like flags or indicators.

13.What is degenerated dimension?

A degenerated dimension is a dimension that is derived from fact table and does not have its own dimension table.

A dimension key, such as

  • transaction number,
  • receipt number,
  • Invoice number
  • empno

etc. does not have any more associated attributes and hence can not be designed as a dimension table.

  • A degenerated dimension is a dimension which is not a fact but presents in the fact table as a primary key.
  • It does not have its own dimension table.
  • We can also call it as a single attribute dimension table.

But, instead of keeping it separately in a dimension table and putting an additional join, we put this attribute in the fact table directly as a key.

Since it does not have its own dimension table, it can never act a foreign key in fact table

 

14.What is a role-playing dimension?

  • Dimensions are often reused for multiple applications within the same database with different contextual meaning.
  • For instance, a “Date” dimension can be used for “Date of Sale”, as well as “Date of Delivery”, or “Date of Hire”.

For example, a date dimension can be used for “Date of Claim”, “Billing date” or “Plan Term date”. So, such a dimension will be called as Role playing dimension. The primary key of Date dimension will be associated with multiple foreign keys in the fact table.

SCD-recap11

 

15.Difference between Inmon and Kimball philosophies of data warehousing?

KimballFirst Data Marts–Combined way —Data warehouse

Inmon—First Data warehouse–Later—-Data marts

 

16.Explain Data Mart Vs Data Warehouse?

 

Data warehouse Data mart
Data warehousing is subject oriented, time variant, non-volatile collection of data which is used for creation of transactional reports as well as historical reports. Data Mart is simply a subset of Organization’s Data warehouse
Definition:The Data Warehouse is a large repository of data collected from different organizations or departments within a corporation.

Definition:The data mart is an only subtype of a Data Warehouse. It is designed to meet the need of a certain user group.
Focus:Data warehouse focuses on multiple business areas. Focus:Data mart focuses only on single subject area.
Usage:It helps to take a strategic decision. Usage:The data mart is used to take tactical decisions for growth of business.
Type of system :This is centralized system where one fact is at center surrounded by dimension tables. Type of system :Data mart system is de centralized system
Scope of Business:The Data warehouse will cover all business areas. Scope of Business:The scope of Data mart is within the line of the Business.
Data Model:Data warehouse always follows top-down model Data Model:Data mart always follows bottom-up model.
Data Size:Data warehouse contains all historical data so the database size is large.Approximate size of data warehouse is greater than 100 GB. Data Size:Data mart contains data with only one business area so the size of database is smaller than data warehouse.
Source:Data warehouse data comes from multiple heterogeneous data sources. Source:Data mart data is data of only one business area.Many times it will come from only one data source.
Implementation Time:Data warehouse contains all data which will come from multiple data sources. It will take time to build data warehouse. The Time to build data warehouse is months to years. Implementation Time:Data mart is small data warehouse which will contain the data of only a single business area. The implementation time to build data mart is in months.

 

17

18.Compare Database & Data Warehouse
Criteria Database/RDBMS Schema/OLTP  Data Warehouse
Type of data Relational or object-oriented data

data oriented ” (ER model)

Large volume with multiple data types

Subject oriented “(Dimensional Model) .

Data operations Transaction processing Data modeling and analysis
Dimensions of data Two dimensional

OLTP

 Highly Normalized

Multi-dimensional

OLAP

De-normalized

Data design ER based and application-oriented Star/Snowflake schema and subject-oriented
Size of data Small ( in GB) Large ( in TB)
Functionality High availability & performance

  • Difficult to extract and solve complex problems
  • inser,update 
High flexibility and user autonomy

  • Relatively easier in extracting the data and solving complex problems
  • The data warehouse supports dimensional modeling which is a design technique to support end-user queries.

19.What is the difference between OLTP and OLAP?

 

 

 

Database Design
OLTP: Normalized small database. Speed will be not an issue due to smaller database and normalization will not degrade performance.
This adopts entity relationship(ER) model and an application-oriented database design.
OLAP: De-normalized large database. Speed is issue due to larger database and de-normalizing will improve performance as there will be lesser tables to scan while performing tasks.
This adopts star, snowflake or fact constellation mode of subject-oriented database design.

 

 

OLAP OLTP
Data Storage:It stores only historical data and historical data processing is done. Data Storage:It involves daily processing of data
Users Of System: OLAP System is used by higher management like managers , analysts , executives,CEO, CFOs

market-oriented

Multi-dimensional views of business activities of planning and decision making

Users Of System: OLTP system used by DBAs, Database Professionals ,clerks,clients,Programmers for applying business logic.

customer-oriented

Snapshot of business processeswhich does fundamental business tasks 

Key use: OLAP is used to analyse the business Key use: OLTP is used to run the business
Optimization Techniques: OLAP is very huge database so lot of indexes are used for fast data processing Optimization Techniques: OLTP uses less indexing as data is less here
Database Schema: OLAP uses Star-Schema,Snowflakes schema or Fact-Dimensions Database Schema: OLTP uses Entity Relations
Data Information: OLAP contains historical data

Consolidationdata is from various sources.

Data Information: OLTP contains Current data

Operational data is from original data source of the data

Join Data: OLAP has less joins and in

de-normalized form database

Complex long running queriesby system to update the aggregated data.

Join Data: OLTP has large no of joins and in normalizedform

Simple quick running queries ran by users.

Aggregated Data: OLAP system has aggregated multidimensional data Aggregated Data: OLTP has not aggregated data.
Summarized data: OLAP system gives Summarized consolidated data

different levels of granularity

Detailed Data: OLTP system gives data in detailed format
Data Size: OLAP database size is 100 GB to 100 TB Data Size: OLTP database size is 100 MB to 100 GB

20

65808331

21.What is Normalization, First Normal Form, Second Normal Form ,Third Normal Form?
 Normalization is process for assigning attributes to entities
  • Reduces data redundancies
  • Helps eliminate data anomalies
  • Produces controlled redundancies to link tables
Normalization is the analysis of functional dependency between attributes /data items of user views,It reduces a complex user view to a set of small and stable subgroups of fields / relations 
INF:
Repeating groups must be eliminated,
Dependencies can be identified,
All key attributes defined,
No repeating groups in table 
2NF:
The Table is already in INF
Includes no partial dependencies–
No attribute dependent on a portion of primary key,
Still possible to exhibit transitive dependency,Attributes may be functionally dependent on non-key attributes
3NF:
The Table is already in 2NF,
Contains no transitive dependencies.
normalization

images

Normalization

 

slide_6