DATA MODELING-2

DATAWAREHOUSING   

 OLTP 

OLAP

Basics

Data Reconciliation

Data Retention

Data Integration

DATA Load

NORMALIZATION:

 

Data_warehouse

ODSHighLevel

ODS1

ods_dw

343002

DWH_P004_001

1_cuvK-o4OvDEM5iBJPmXb0Q

 

1.Difference between Inmon and Kimball philosophies of data warehousing?

KimballFirst Data Marts–Combined way —Data warehouse

Inmon—First Data warehouse–Later—-Data marts

 

2.What is a data warehousing?
  • repository of integrated information
  • extracted from heterogeneous sources
  • time varient, non volatile, integrated and subject oriented
  • support decision making.
  • historical data for the purpose of data analytics.
  • A Data Warehouse can be defined as a centralized, consistent data store or Decision Support System (OLAP), for the end business users
    • for analysis, prediction and decision making
    • . Data from various enterprise-wide application/transactional source systems (OLTP), are extracted, cleansed, integrated, transformed and loaded in the Data Warehouse.

3.Properties of a Data Warehouse

Subject-oriented

based on the major subjects areas of the business, rather than the major application areas of the enterprise. The ultimate necessity is to store decision-support data rather than application-oriented data.Subject-oriented data warehouses are those that store data around a particular “subject” such as customer, sales, product, among others.

Integrated

The data in Data Warehouse comes from different enterprise-wide application source systems. The source data is often inconsistent in nature. During the Integration process, source data must be made consistent in Data Warehouse so that the data becomes homogeneous and uniform.

Time-variant

The source data in the Data Warehouse is only accurate and valid at some point in time or over some time interval.

Non-volatile Data

in Data Warehouse is not updated or replaced in real time but is refreshed from operational systems on a regular basis. New data is always added incrementally integrating it with the previous data set.

  • It is separate from Operational Database.
  • Stores HUGE amount of data, more historical than current data.
  • Does not require data to be highly accurate.
  • Queries are generally complex.
  •  higher management  execute statistical queries  to take strategic as well as tactical decisions using historical or current data.
  • These systems are thus called Online Analytical Processing Systems (OLAP).
  • Data warehouse is a database which is separate from operational database which stores historical information also.
  • Data warehouse database contains transactional as well as analytical data.
  • to see the current trends to run the business.
  •  used for reporting and data analysis purpose.
  •  The part of data can be denormalized so that it can be simplified and improve the performance of the same.
  •   The data load is controlled
  • Ad hoc queries and planned queries are very common when it comes to data extraction.

The following activities are involved in Data Warehouse:

  •  Retrieving the data
  • Analyzing the data
  • Extraction of the data, sometimes customized extraction is also applicable
  • Loading of the data
  • Managing and transforming the data

 

4.Why we go for Data Warehouse instead of OLTP Reporting and Analysis

Why should you put your data warehouse on a different system than your OLTP 

Why are OLTP database designs not generally a good idea for a Data Warehouse?
  •  Tables are normalized in OLTP and hence query response will be slow for end user
  • OLAP reporting requires historical data whereas the operational databases do not maintain history and updates the transactions.
  • Data in Data Warehouse is loaded after consolidation, integration, aggregation as well as summarization of data from various heterogeneous OLTP source systems. Hence in that case we will fail to generate OLTP reporting.
  • Also the data in the different source systems uses inconsistent data representations, codes, and formats which have to be reconciled before loading into Data Warehouse.
  • Finally, complex OLAP queries will obviously degrade the performance of operational / transactional source systems.
  • The OLTP databases are ONL designed and tuned to support insert and update operations mainly
  • OLTP  is data oriented ” (ER model)   OLAP IS Subject oriented “(Dimensional Model)
  • That is why we design a separate system that will have a subject oriented OLAP system.Moreover if a complex query is fired on a OLTP system will cause a heavy overhead on the OLTP server that will affect the day today business directly.

5.Benefits of the Data Warehouse implementation:

 

  •  It helps us to maintain a copy of the information from different source transaction systems.
  •  It helps to gather data from different systems into one
  • Helps us in maintaining data history, most of the time this cannot be maintained at source data system
  •  Improves the data quality, as the data is cleansed and transformed so that it matches exactly the way the data is needed
  • Help in data reconstruction so that the raw data that has been gathered can make sense to the relevant users.
  •  More importantly, it saves times
  • Generates High ROI because the data that has been gathered so far is helpful for the business users where they consider the feedback from the data analysis
  •  to facilitate reporting on different key business processes of an organization, known as KPI. Today we often call this whole process of reporting data from data warehouses as “Data Analytics“.
  • help to integrate data from different sources and show a single-point-of-truth values about the business measures (e.g. enabling Master Data Management).
  • Data warehouse can be further used for data mining which helps trend prediction, forecasts, pattern recognition etc.
  • Faster Data Retrieval
  • Better Understandability
  • Extensibility

6.What are different types of Data warehouse Systems?

1Capture

 

 7.Application of data warehousing system?

  • Consistent and Quality of data: Data warehouse system is consistent and cost-effective for various industries for collection their customer data through various resources
  • Cost reduction: A data warehouse system reduce cost by storing all electronically collected data in a data warehouse
  • Accessibility: A data warehouse system data can have easy access to data time to time for business improvement and reporting.

Data warehouse Architecture

As discussed before, a data warehouse helps business executives to

  • organize,
  • analyze, and
  • use their data for decision making.
  • Tuning Production Strategies− The product strategies can be well tuned by repositioning the products and managing the product portfolios by comparing the sales quarterly or yearly.
  • Customer Analysis − Customer analysis is done by analyzing the customer’s buying preferences, buying time, budget cycles, etc.
  • Operations Analysis − Data warehousing also helps in customer relationship management, and making environmental corrections. The information also allows us to analyze business operations.
  • Information Processing − A data warehouse allows to process the data stored in it. The data can be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts, or graphs.
  • Analytical Processing − A data warehouse supports analytical processing of the information stored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and pivoting.
  • Data Mining − Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction. These mining results can be presented using the visualization tools

A data warehouse serves as a sole part of a plan-execute-assess “closed-loop” feedback system for the enterprise management. Data warehouses are widely used in the following fields −

  • Financial services
  • Banking services
  • Consumer goods
  • Retail sectors
  • Controlled manufacturing

 

.List any five applications of data warehouse.

 Some applications include

  • financial services
  • banking services
  • customer goods
  • retail sectors
  • controlled manufacturing

8.Explain Data Mart Vs Data Warehouse?

 

Data warehouse Data mart
Data warehousing is subject oriented, time variant, non-volatile collection of data which is used for creation of transactional reports as well as historical reports. Data Mart is simply a subset of Organization’s Data warehouse
Definition:The Data Warehouse is a large repository of data collected from different organizations or departments within a corporation.

Definition:The data mart is an only subtype of a Data Warehouse. It is designed to meet the need of a certain user group.
Focus:Data warehouse focuses on multiple business areas. Focus:Data mart focuses only on single subject area.
Usage:It helps to take a strategic decision. Usage:The data mart is used to take tactical decisions for growth of business.
Type of system :This is centralized system where one fact is at center surrounded by dimension tables. Type of system :Data mart system is de centralized system
Scope of Business:The Data warehouse will cover all business areas. Scope of Business:The scope of Data mart is within the line of the Business.
Data Model:Data warehouse always follows top-down model Data Model:Data mart always follows bottom-up model.
Data Size:Data warehouse contains all historical data so the database size is large.Approximate size of data warehouse is greater than 100 GB. Data Size:Data mart contains data with only one business area so the size of database is smaller than data warehouse.
Source:Data warehouse data comes from multiple heterogeneous data sources. Source:Data mart data is data of only one business area.Many times it will come from only one data source.
Implementation Time:Data warehouse contains all data which will come from multiple data sources. It will take time to build data warehouse. The Time to build data warehouse is months to years. Implementation Time:Data mart is small data warehouse which will contain the data of only a single business area. The implementation time to build data mart is in months.

 

9.What is data mart?

  • designed for a single subject area.
  • An organization may have data pertaining to different departments like Finance, HR, Marketing etc. stored in data warehouse and each department may have separate data marts.
  • These data marts can be built on top of the data warehouse.
  • A data mart (DM) is a specialized version of a data warehouse (DW)
  • Like data warehouses, data marts contain a snapshot of operational data that helps business people to strategize based on analyses of past trends and experiences.
  • The key difference is that the creation of a data mart is predicated on a specific, predefined need for a certain grouping and configuration of select data.

two types of Data Marts

  • Independent data marts
    sources from data captured form

    • OLTP system,
    • external providers or from
    • data generated locally within a particular department or geographic area
  • Dependent data mart
    sources directly from enterprise data warehouses.

 

10.What is data marting? Explain the different kinds of costs associated?

The data marting is also called as a “data mart”.
process of redefining information about a specific data set that makes sense for a particular group.

costs associated with data marting are as follows:

  • Hardware related costs
  • Software related costs
  • Network access-related costs
  • Time costs

11.Why Do We Need a Data Mart?

  • To partition data in order to impose access control strategies.
  • To speed up the queries by reducing the volume of data to be scanned.
  • To segment data into different hardware platforms.
  • To structure data in a form suitable for a user access tool.

 

12.What is ODS?

 

  • ODS is abbreviated as Operational Data Store and it is a repository of real time operational data rather than long term trend data
  •  A collection of operation or bases data that is extracted from operation databases and standardized, cleansed, consolidated, transformed,and loaded into an enterprise data architecture.
  • An ODS is used to support data mining of operational data, or as the store for base data that is summarized for a data warehouse.
  • The ODS may also be used to audit the data warehouse to assure summarized and derived data is calculated properly.
  • The ODS may further become the enterprise shared operational database, allowing operational systems that are being re engineered to use the ODS as there operation databases.
  • An operational data store (“ODS“) is a database designed to integrate data from multiple sources for additional operations on the data.
  • Unlike a master data store, the data is not sent back to operational systems.
  • It may be passed for further operations and to the data warehouse for reporting.
  • In ODS, data can be scrubbed, resolved for redundancy and checked for compliance with the corresponding business rules.
  • This data store can be used for integrating disparate data from multiple sources so that business operations, analysis and reporting can be carried while business operations occur.
  • This is the place where most of the data used in current operation is housed before it’s transferred to the data warehouse for longer term storage or archiving.
  •  An ODS is designed for relatively simple queries on small amounts of data(such as finding the status of a customer order), rather than the complex queries on large amounts of data typical of the data warehouse.
  •  An ODS is similar to your short term memory where it only stores very recent information. On the contrary, the data warehouse is more like long term memory storing relatively permanent information

 

13.What are operational DBMS?

  • They generally use the E-R data model.
  •  It is used to store transactional data.
  •  The information content is generally recent.
  •  These are thus called as OLTP systems.
  •  Their goals are data accuracy & consistency , Concurrency , Recoverability, Reliability (ACID Properties).
14.Compare Database & Data Warehouse
Criteria Database/RDBMS Schema/OLTP  Data Warehouse
Type of data Relational or object-oriented data

data oriented ” (ER model)

Large volume with multiple data types

Subject oriented “(Dimensional Model) .

Data operations Transaction processing Data modeling and analysis
Dimensions of data Two dimensional

OLTP

 Highly Normalized

Multi-dimensional

OLAP

De-normalized

Data design ER based and application-oriented Star/Snowflake schema and subject-oriented
Size of data Small ( in GB) Large ( in TB)
Functionality High availability & performance

  • Difficult to extract and solve complex problems
  • inser,update 
High flexibility and user autonomy

  • Relatively easier in extracting the data and solving complex problems
  • The data warehouse supports dimensional modeling which is a design technique to support end-user queries.

15.Types of data warehouse architectures:

  • There are mainly 3 types of data warehouse architectures:
 
Single-tier architecture
  • The objective of a single layer is to minimize the amount of data stored by removing data redundancy.
  • It is not frequently used in practice.
Two-tier architecture
  • This architecture separates physically available sources from the data warehouse.
  • This architecture is not expandable & does not support a large number of end-users.
  • Because of network limitations, this architecture faces connectivity issues.
Three-tier architecture
  • It is the most widely used architecture that is consist of the Top, Middle and Bottom Tier.
  • Bottom Tier:  Usually a relational database of the Datawarehouse serves as the bottom tier where Data is cleansed, transformed, and loaded.
  • Middle Tier: This application tier is an OLAP server & presents an abstracted view of the database which acts as a mediator between the end-user and the database.
  • Top-Tier: The top tier is a front-end client layer channels data out of the data warehouse.

16.Functions of Data Warehouse Tools and Utilities

  • Data Extraction− Involves gathering data from multiple heterogeneous sources.
  • Data Cleaning− Involves finding and correcting the errors in data.
  • Data Transformation − Involves converting the data from legacy format to warehouse format.
  • Data Loading− Involves sorting, summarizing, consolidating, checking integrity, and building indices and partitions.
  • Refreshing− Involves updating from data sources to warehouse.
  • operation of a data warehouse or data mart
    •   including meta data management,
    • data acquisition,
    • data cleansing,
    • data transformation,
    • storage management,
    • data distribution,
    • data archiving,
    • operational reporting,
    • analytical reporting,
    • security management,
    • backup/recovery planning, etc.

Note − Data cleaning and data transformation are important steps in improving the quality of data and data mining results.

 

17.What is snapshot with reference to data warehouse?

  • Snapshot refers to a complete visualization of data at the time of extraction.
  • It occupies less space and can be used to back up and restore data quickly.
  • A snapshot is a process of knowing about the activities performed. It is stored in a report format from a specific catalog. The report is generated soon after the catalog is disconnected.

18.What is real time data-warehousing?
 Real-time data warehousing is a combination of two things:
  • real-time activity
    • Real-time activity is activity that is happening right now.
    • The activity could be anything such as the sale of widgets.
    • Once the activity is complete, there is data about it.
  • data warehousing.
    • Data warehousing captures business activity data.
    • Real-time data  warehousing  captures business activity data as it occurs.
    • As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly.
    • In other words, real-time data warehousing is a framework for deriving information from data as the data becomes available.
19. What is Virtual Data Warehousing?
  • Virtual data warehousing is a ‘de facto’ information system strategy for supporting analytical decision making. It is one of the best ways for translating raw data and presenting it in the form that can be used by decision makers. It provides semantic map – which allows the end user for viewing as virtualized.
  •  A virtual data warehouse provides a collective view of the completed data.
  • A virtual data warehouse has no historic data.
  • It can be considered as a logical data model of the containing metadata.
20.What is active data warehousing?
  • An active data warehouse represents a single state of the business. Active data warehousing considers the analytic perspectives of customers and SUPPLIERS. It helps to deliver the updated data through reports.
  • A form of repository of captured transactional data is known as ‘active data warehousing’. Using this concept, trends and patterns are found to be used for future decision making.
  • Active data warehouse has a feature which can integrate the changes of data while scheduled cycles refresh.

21.What are fundamental stages of Data Warehousing 

Datawarehouse

  • Offline Operational Database
  • Offline Data Warehouse
  • Real Time Datawarehouse
  • Integrated Datawarehouse
Offline Operational Databases – Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system’s performance.
Offline Data Warehouse – Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure.
Real Time Data Warehouse – Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)
Integrated Data Warehouse – Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.

22.What are the steps to build the datawarehouse?

9Capture

  • Gatheringbusiness requirements
  • Identifying the necessary sources
  • Identifying the facts
  • Defining the dimensions
  • Defining the attributes
  • Redefine the dimensions and attributes if required
  • Organize the Attribute hierarchy
  • Define Relationships
  • Assign unique Identifiers
  • Additional conventions:Cardinality/Adding ratios
.List the phases involved in the data warehouse delivery process.

 The stages are IT strategy,

  • Education
  • Business Case Analysis
  • technical Blueprint
  • Build the version
  • History Load
  • Ad hoc query
  • Requirement Evolution
  • Automation
  • and Extending Scope
23.What is the purpose of cluster analysis in Data Warehousing?
  • Cluster analysis is used to define the object without giving the class label.
  • It analyzes all the data that is present in the data warehouse and compare the cluster with the cluster that is already running.
  • It performs the task of assigning some set of objects into the groups also known as clusters.
  • It is used to perform the data mining job using the technique like statistical data analysis. It includes all the information and knowledge around many fields like machine learning, pattern recognition, image analysis and bio-informatics.
  • Cluster analysis performs the iterative process of knowledge discovery and includes trials and failures.
  • It is used with the pre-processing and other parameters as a result to achieve the properties that are desired to be used.

24.In which phase of the Active Data Warehouse evolution do you use data to determine what will happen?

In predicting phase. Analysts utilize the system to leverage information to predict what will happen next in the business to proactively manage the organization’s strategy.This stage requires data mining tools and building predictive models using historical detail. As an example, users can model customer demographics for target marketing.

25.What is the difference between Data warehousing and Business Intelligence
 Data warehousing deals with all aspects of managing the 
  • development,
  • implementation
  • operation of a data warehouse or data mart
    •   including meta data management,
    • data acquisition,
    • data cleansing,
    • data transformation,
    • storage management,
    • data distribution,
    • data archiving,
    • operational reporting,
    • analytical reporting,
    • security management,
    • backup/recovery planning, etc.
Business intelligence, on the other hand, is a set of software tools that enable an organization to
  • analyze measurable aspects of their business such as
  • sales performance, profitability, operational 
  • efficiency,
  • effectiveness of marketing campaigns,
  • market penetration among certain customer groups,
  • cost trends,
  • anomalies and exceptions, etc.
Typically, the term “business intelligence” is used to encompass
  • OLAP,
  • data visualization,
  • data mining 
  • query/reporting tools.
26.Why is chameleon method used in data warehousing?
  • Chameleon is a hierarchical clustering algorithm that overcomes the limitations of the existing models and the methods present in the data warehousing.
  • This method operates on the sparse graph having nodes: that represent the data items, and edges: representing the weights of the data items.
  • This representation allows large dataset to be created and operated successfully. The method finds the clusters that are used in the dataset using two phase algorithm.
    • The first phase consists of the graph partitioning that allows the clustering of the data items into large number of sub-clusters.
    • Second phase uses an agglomerative hierarchical clustering algorithm to search for the clusters that are genuine and can be combined together with the sub-clusters that are produced.

27.What are loops in Datawarehousing?

In datawarehousing, loops are existing between the tables. If there is a loop between the tables, then the query generation will take more time and it creates ambiguity. It is advised to avoid loop between the tables

28.Comparison between Data Warehouse and Mediated Schema approaches

 

Tight-Coupling (data warehouse)
Loose Coupling (mediated schema)
Advantages
  1. Independence (Lesser dependency to source systems since data is physically copied over)
  2. Faster query processing
  3. Complex query processing
  4. Advanced data summarization and storage possible
  5. High Volume data processing
Advantages
  1. Data Freshness (low latency – almost realtime)
  2. Higher Agility (when a new source system comes or existing source system changes – only the corresponding adapter is created or changed – largely not affecting the other parts of the system)
  3. Less costlier (Lot of infrastructure cost can be saved since data localization not required)
Disadvantages
  1. Latency (since data needs to be loaded using ETL)
  2. Costlier (data localization, infrastructure, security)
Disadvantages
  1. Semantic conflicts (The meaning of the measure “net profit” can be different in different systems – so semantic awareness is often necessary in the mediated schema*)
  2. Slower query response (due to network / bandwidth issue, non-localization of data / work load on the source system etc.)
  3. High order dependency to the data sources

* Semantic awareness can be achieved by the process of metadata publishing

 

29.What is the definition of Cube in Datawarehousing?

  • Cubes are logical representation of multidimensional data.
  • The edge of the cube has the dimension members,and the body of the cube contains the data values

30.Define data analytics in the context of data warehousing.

  • Data analytics is the science of examining raw data with the purpose of drawing business-driven conclusions about that data.
  • The role of a data warehouse is to enable data analysis.
31.Critical Features of an ETL framework
  • Support for Change Data Capture Or Delta Loading Or Incremental Loading
  • Metadata logging
  • Handling of multiple source formats
  • Restartability support
  • Notification support
  • Highly configurable / customizable

32.Good-to-have features of ETL Framework

  • Inbuilt data reconciliation
  • Customizable log format
  • Dry-load enabling
  • Multiple notification formats

.Top 5 Challenges of Data Warehousing

  • Ensuring Acceptable Data Quality
    • Disparate data sources add to data inconsistency
    • Unstabilized source systems
  • Ensuring acceptable Performance
    • Prioritizing performance
    • Setting realistic goal
    • Performance by design
  • Testing the data warehouse
    • Test planning
    • No automated testing
      • given the lack of standardization in how the metadata are defined and design approaches are followed in different data warehousing projects.
  • Reconciliation of data :
    • Reconciliation is a process of ensuring correctness and consistency of data in a data warehouse. Unlike testing, which is predominantly a part of software development life cycle, reconciliation is a continuous process that needs to be carried out even after the development cycle is over.
    • Reconciliation is complex
  • User Acceptance

DETAILED STUDY :

https://dwbi.org/data-modelling/dw-design/11-top-5-challenges-of-data-warehousing

https://dwbi.org/data-modelling/dw-design/12-top-10-things-you-must-know-before-designing-a-data-warehouse

.What is Decision Support System

 

Decision Support System (DSS) is a class of information systems (including but not limited to computerized systems) that support business and organizational decision-making activities. A properly designed DSS is an interactive software-based system intended to help decision makers compile useful information from a combination of raw data, documents, personal knowledge, or business models to identify and solve problems and make decisions.

Typical information that a decision support application might gather and present :

  • An inventory of all of your current information assets (including legacy and relational data sources, cubes, data warehouses, and data marts)
  • Comparative sales figures between one week and the next
  • Projected revenue figures based on new product sales assumptions etc.

.Components of a Decision Support System

DSS components may be classified as:

  • Inputs: Factors, numbers, and characteristics to analyze. This is basically the raw data that you put to the system.
  • User Knowledge and Expertise: Inputs requiring manual analysis by the user
  • Outputs: Transformed data from which DSS “decisions” are generated
  • Decisions: Results generated by the DSS based on user criteria

Detailed study:

https://dwbi.org/data-modelling/dw-design/7-decision-support-system-dss

 

.List the functions of data warehouse tools and utilities.
The functions performed by Data warehouse tool and utilities are

Data Extraction
Data Cleaning
Data Transformation
Data Loading and
Refreshing
.List the process that are involved in Data Warehousing.
Data Warehousing involves data cleaning, data integration and data consolidations.

 .List the Schema that a data warehouse system can implements.
A data Warehouse can implement

star schema
snowflake schema and
fact constellation schema

 

.What is called data cleaning?

Cleaning of Orphan records
Data breaching business rules
Inconsistent data and
missing information in a database

.What are the languages used in Data cleansing?

R – Programming language, SQL– Structure Query Language, Advance Excel Macros.

OLTP 

33.What is the difference between OLTP and OLAP?

 

 

 

Database Design
OLTP: Normalized small database. Speed will be not an issue due to smaller database and normalization will not degrade performance.
This adopts entity relationship(ER) model and an application-oriented database design.
OLAP: De-normalized large database. Speed is issue due to larger database and de-normalizing will improve performance as there will be lesser tables to scan while performing tasks.
This adopts star, snowflake or fact constellation mode of subject-oriented database design.

 

 

OLAP OLTP
Data Storage:It stores only historical data and historical data processing is done. Data Storage:It involves daily processing of data
Users Of System: OLAP System is used by higher management like managers , analysts , executives,CEO, CFOs

market-oriented

Multi-dimensional views of business activities of planning and decision making

Users Of System: OLTP system used by DBAs, Database Professionals ,clerks,clients,Programmers for applying business logic.

customer-oriented 

Snapshot of business processes which does fundamental business tasks 

Key use: OLAP is used to analyse the business Key use: OLTP is used to run the business
Optimization Techniques: OLAP is very huge database so lot of indexes are used for fast data processing Optimization Techniques: OLTP uses less indexing as data is less here
Database Schema: OLAP uses Star-Schema,Snowflakes schema or Fact-Dimensions Database Schema: OLTP uses Entity Relations
Data Information: OLAP contains historical data

Consolidation data is from various sources.

Data Information: OLTP contains Current data

Operational data is from original data source of the data

Join Data: OLAP has less joins and in

de-normalized form database

Complex long running queries by system to update the aggregated data.

Join Data: OLTP has large no of joins and in normalized form

Simple quick running queries ran by users.

Aggregated Data: OLAP system has aggregated multidimensional data Aggregated Data: OLTP has not aggregated data.
Summarized data: OLAP system gives Summarized consolidated data

different levels of granularity

Detailed Data: OLTP system gives data in detailed format
Data Size: OLAP database size is 100 GB to 100 TB Data Size: OLTP database size is 100 MB to 100 GB

34.What is the difference between ODS and OLTP?
ODS:- It is nothing but a collection of tables created in the Datawarehouse that maintains only current data.
ODS has broad enterprise wide scope ,but unlike the real enterprise  data warehouse ,data is refreshed in near real time and used for routine business activity
ODS: Operational Data Source/Store, The source from where we will get the data for DWH is called ODS
OLTP maintains the data only for transactions, these are designed for recording daily operations and transactions of a business.
OLTP is online transaction process which comes across daily bases data .
OLTP: We have content(Front End) associated with the back end.a fact table without facts.i,e there is no key measure to analyse the business.

35.What are two examples of an OLTP environment?

On Line Banking On Line Reservation(Transportation like Rail, Air etc.)

 1- ATM. 2- POS.

 

OLAP

36.What is the difference between OLAP and datawarehosue?

  •  Data warehouse is the place where the data is stored for analyzing 
  • OLAP is the process of analyzing the data,managing aggregations,partitioning information into cubes for in depth visualization.

37.What does OLAP stand for?

OLAP stands for On Line Analytical Processing. It is a system which collects, manages, and processesmulti-dimensional data for analysis and management.

 38.List the types of OLAP servers.

  • Relational OLAP
  • Multidimensional OLAP
  • Hybrid OLAP
  • Specialized SQL Servers

39.List some of the functions performed by OLAP.

  • roll-up
  • drill-down
  • slice
  • dice and
  • pivot

40.What is slicing-dicing?

  • Slicing means showing the slice of a data, given a certain set of dimension (e.g. Product) and value (e.g. Brown Bread) and measures (e.g. sales).
  • Dicing means viewing the slice with respect to different dimensions and in different level of aggregations.
  • Slicing and dicing operations are part of pivoting.
  •  Only one dimension is selected for the slice operation.
  •  For dice operation two or more dimensions are selected for a given cube.

41.What is drill-through?

  • Drill through is the process of going to the detail level data from summary data.
  • to obtain the details from the aggregated data is called drill through.

 

42.Data Warehousing – Relational OLAP

 

  • ROLAP servers are placed between relational back-end server and client front-end tools.
  • To store and manage warehouse data,
  • ROLAP uses relational or extended-relational DBMS.

ROLAP includes the following −

  • Implementation of aggregation navigation logic.
  • Optimization for each DBMS back end.
  • Additional tools and services.

Points to Remember

  • ROLAP servers are highly scalable.
  • ROLAP tools analyze large volumes of data across multiple dimensions.
  • ROLAP tools store and analyze highly volatile and changeable data.

Relational OLAP Architecture

ROLAP includes the following components −

  • Database server
  • ROLAP server
  • Front-end tool.

Rolap Architecture

Advantages

  • ROLAP servers can be easily used with existing RDBMS.
  • Data can be stored efficiently, since no zero facts can be stored.
  • ROLAP tools do not use pre-calculated data cubes.
  • DSS server of micro-strategy adopts the ROLAP approach.

Disadvantages

  • Poor query performance.
  • Some limitations of scalability depending on the technology architecture that is utilized.

 43.Data Warehousing – Multidimensional OLAP

MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore, many MOLAP server use two levels of data storage representation to handle dense and sparse data sets.

Points to Remember −

  • MOLAP tools process information with consistent response time regardless of level of summarizing or calculations selected.
  • MOLAP tools need to avoid many of the complexities of creating a relational database to store data for analysis.
  • MOLAP tools need fastest possible performance.
  • MOLAP server adopts two level of storage representation to handle dense and sparse data sets.
  • Denser sub-cubes are identified and stored as array structure.
  • Sparse sub-cubes employ compression technology.

MOLAP Architecture

MOLAP includes the following components −

  • Database server.
  • MOLAP server.
  • Front-end tool.

Molap Architecture

Advantages

  • MOLAP allows fastest indexing to the pre-computed summarized data.
  • Helps the users connected to a network who need to analyze larger, less-defined data.
  • Easier to use, therefore MOLAP is suitable for inexperienced users.

Disadvantages

  • MOLAP are not capable of containing detailed data.
  • The storage utilization may be low if the data set is sparse.

44.Hybrid OLAP

  • Hybrid OLAP is a combination of both ROLAP and MOLAP.
  • It offers higher scalability of ROLAP and faster computation of MOLAP.
  • HOLAP servers allows to store the large data volumes of detailed information.
  • The aggregations are stored separately in MOLAP store.

45.Specialized SQL Servers

Specialized SQL servers provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment.

 

46.Which one is faster, Multidimensional OLAP or Relational OLAP?

Multidimensional OLAP is faster than Relational OLAP.

  • MOLAP: Multi-dimensional OLAP Data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats (one example is Power OLAP’s .olp file).  products can be compatible with Excel, which can make data interactions easy to learn.
  • ROLAP: Relational OLAP ROLAP products access a relational database by using SQL (structured query language), which is the standard language that is used to define and manipulate data in an RDBMS. Subsequent processing may occur in the RDBMS or within a mid-tier server, which accepts requests from clients, translates them into SQL statements, and passes them on to the RDBMS.

47. MOLAP vs ROLAP

 

MOLAP ROLAP
1 Information retrieval is fast. Information retrievalis comparatively slow.
2 Uses sparse array to store data-sets. Uses relational table.
3 MOLAP is best suited for inexperienced users, since it is very easy to use. ROLAP is best suited for experienced users.
4 Maintains a separate database for data cubes. It may not require space other than available in the Data warehouse.
5 DBMS facility is weak. DBMS facility is strong.

 

 

BASICS

48.How will you do data transformation in data warehouse platform?

Data will be transferred by different database tools such as MySQL Tools, MS-Access Tools this tools is connected to a server such as SQL Server, Oracle Server. After setting up the environment by using SQL with the support of Shell scripting language will be able to transfer the data to the data warehouse system.

 

49.What are the techniques used in data transformation?

SQL: Structure Query Language is mainly used for data transformation. By using a SELECT command of Structure query language and from shell scripting language SCP and SSH command is used to connect to the data warehouse server for data transformation.

 

50.Define Transformation ?

  • Transformation is the core of the reconciliation phase.
  • It converts data from its operational source format into a specific data warehouse format.
  • If you implement a three-layer architecture, this phase outputs your reconciled data layer.

51.What are different stages of Business Intelligence?

  • Data Source
  • Data Analysis
  • Decision making support
  • Situation Awareness
  • Risk Management

52.Why is it Necessary to Partition?

  • For easy management,
  • To assist backup/recovery,
  • To enhance performance.

Horizontal Partitioning

  • Partitioning by Time into Equal Segments
  • Partition by Time into Different-sized Segments
  • Partition on a Different Dimension
  • Partition by Size of Table
  • Partitioning Dimensions

If a dimension contains large number of entries, then it is required to partition the dimensions. Here we have to check the size of a dimension.

Consider a large design that changes over time. If we need to store all the variations in order to apply comparisons, that dimension may be very large. This would definitely affect the response time.

Round Robin Partitions

In the round robin technique, when a new partition is needed, the old one is archived. It uses metadata to allow user access tool to refer to the correct table partition.

This technique makes it easy to automate table management facilities within the data warehouse.

Vertical Partition

Vertical partitioning, splits the data vertically. The following images depicts how vertical partitioning is done.

Vertical Partitioning

Vertical partitioning can be performed in the following two ways −

  • Normalization
  • Row Splitting

 

53.What is ETL?

ETL is abbreviated as Extract, Transform and Load. ETL is a software which is used to reads the data from the specified data source and extracts a desired subset of data. Next, it transform the data using rules and lookup tables and convert it to a desired state.

Then, load function is used to load the resulting data to the target database.

  • ETL is abbreviation of extract, transform, and load.
  • ETL is software that enables businesses to consolidate their disparate data while moving it from place to place, and it doesn’t really matter that data is in different forms or formats.
  • The data can come from any source.
  • ETL is powerful enough to handle such data disparities.
  • First, the extract function reads data from a specified source database and extracts a desired subset of data.
  • Next, the transform function works with the acquired data – using rules or lookup tables, or creating combinations with other data – to convert it to the desired state.
  • Finally, the load function is used to write the resulting data to a target database´.

54.What is meant by Data Analytics?

  • Data analytics (DA) is the science of examining raw data with the purpose of drawing conclusions about that information.
  • A data warehouse is often built to enable Data Analytics

55.What is a table (entity)?

Data stored in form of rows and columns is called as table. Each column has                  datatype and based on the situation, integrity constraints are enforced on columns.

56.What is a column (attribute)?

Column also known as field is a vertical alignment of the data and contains related    information to that column.

57.What is a row?

Row also known as tuple or record is the horizontal alignment of the data.

58.What is a sequence?

Sequence is a database object to generate unique number

59.What are the possible data marts in Retail sales.?
  • Product information
  • sales information
60.What is data validation strategies for data mart validation after loading process?
  •  Data validation is to make sure that the loaded data is accurate and meets the business requirements .
  • Strategies are different methods followed to meet the validation requirements
61.What is a general purpose scheduling tool?

 

 The basic purpose of the scheduling tool in a DW Application is to stream line the flow of data from Source To Target at specific time or based on some condition.

62.Can we take backup when the database is opened?

No, We cannot take full backup when the database is opened.

 63.What is defined as Partial Backup?

A Partial backup in an operating system is a backup short of full backup and it can be done while the database is opened or shutdown

 64.What is the goal of Optimizer?

The goal to Optimizer is to find the most efficient way to execute the SQL statements.

65.What is Execution Plan?

Execution Plan is a plan which is used to the optimizer to select the combination of the steps.

 66.What is the language that is used for schema definition?

Data Mining Query Language (DMQL) is used for schema definition.

67.What language is the base of DMQL?

DMQL is based on Structured Query Language (SQL).

 .Expand what DMQL stands for and what is the language that is used in DMQL?

DMQL stand for Data Mining Query Language.

  • This language is used for schema definition.
  • The language that is used in DMQL is nothing but SQL language.
  • SQL stands for Structured Query Language.
68.What do you mean by Data Extraction?
Data extraction means gathering data from multiple heterogeneous sources 
69.What is a lookup table?
  • A lookUp table is the one which is used when updating a warehouse.
  • When the lookup is placed on the target table (fact table / warehouse) based upon the primary key of the target, it just updates the table by allowing only new records or updated records based on the lookup condition.
70.What type of Indexing mechanism do we need to use for a typical datawarehouse?
  •  On the fact table it is best to use bitmap indexes.
  • Dimension tables can use bitmap and/or the other types of
    • clustered/non-clustered,
    • unique/non-unique indexes.
SQLServer does not support bitmap indexes. Only Oracle supports bitmaps.
71.What is a linked cube?
  •  Linked cube in which a sub-set of the data can be analysed into great detail.
  • The linking ensures that the data in the cubes remain consistent.
72.What is BUS Schema?
 BUS Schema is composed of a master suite of
  • confirmed dimension
  • standardized definition if facts.
73.What is a BO repository?
 Repository means set of database tables.
  • Business object store 
    •  security information e.g user, group, access permission,
    • user type etc. ,
    • universe information e.g. objects, classes, table name, column name
    • relationship etc.
    • document information.

74.What are the various ETL tools in the Market?

  • Informatica
  • Data Stage 
  • Oracle Warehouse Builder 
  • Ab Initio
  • Data Junction 
  • Humming Bird Genio,
  • Business Objects Data Integrator
  • MS-SQL  DTS (Integrated Services 2005)
  • SQL Loader
  • Sunopsis
 75.What are modeling tools available in the Market?
  • Erwin Computer Associates ,
  • Embarcadero Embarcadero Technologies ,
  • Rational Rose IBM Corporation ,
  • Power Designer Sybase Corporation ,
  • Oracle Designer Oracle Corporation
  • Informatica (Cubes/Dimensions)
  • ERW in (Entity Relationship for windows)
 76.What are the various Reporting tools in the Market?
  • MS-Excel ,
  •  Business Objects (Crystal Reports) ,
  •  Cognos(Impromptu, Power Play) ,
  •  Microstrategy ,
  •  MS reporting services ,
  • Informatica Power Analyzer ,
  •  Actuate ,
  •  Hyperion (BRIO) ,
  •  Oracle Express OLAP ,
  •  Proclarity

 

77.What is data mining?
  • Data mining is a process of extracting hidden trends within a datawarehouse.
  • For example an insurance datawarehouse house can be used to mine data for the most high risk people to insure in a certain geographical area
  • Data Mining is set to be a process of analyzing the data in different dimensions or perspectives and summarizing into a useful information. Can be queried and retrieved the data from database in their own format

78.What are the approaches used by Optimizer during execution plan?

There are two approaches:

  1. Rule Based
  2. Cost Based

79.What needs to be done while starting the database?

Following need to be done to start the database:

  1. Start an Instance
  2. Mount the database
  3. Open the database

80.What needs to be done when the database is shutdown?

Following needs to be done when the database is shutdown:

  1. Close the database
  2. Dismount the database
  3. Shutdown the Instance

81.What is summary information?

Summary Information is the location within data warehouse where predefined aggregations are stored

The primary functions of the dimensions are as follows:

  • Filtering
  • Grouping
  • Labelling

82.What is the difference between View and Materialized View?

View:

– Tail raid data representation is provided by a view to access data from its table.

– It has logical structure that does not occupy space.

– Changes get affected in corresponding tables.

Materialized view:

Pre-calculated data persists in materialized view.

– It has physical data space occupation.

– Changes will not get affected in corresponding tables

 

 

83.Common aggregate functions include :

  • Average (i.e., arithmetic mean)
  • Count
  • Maximum
  • Median
  • Minimum
  • Mode
  • Range
  • Sum

84.Purpose of cluster analysis :-

  • Scalability
  • Ability to deal with different kinds of attributes
  • Discovery of clusters with attribute shape
  • High dimensionality
  • Ability to deal with noisy
  • Interpretability

 

 85.What is the difference between agglomerative and divisive Hierarchical Clustering?
  • Agglomerative Hierarchical clustering method allows the clusters to be read from bottom to top so that the program always reads from the sub-component first then moves to the parent whereas
  • Divisive Hierarchical clustering uses top-bottom approach in which the parent is visited first than the child.
  • Agglomerative hierarchical method consists of objects in which each object creates its own clusters and these clusters are grouped together to create a large cluster. It defines a process of continuous merging until all the single clusters are merged together into a complete big cluster that will consist of all the objects of child clusters.
  •  divisive clustering, the parent cluster is divided into smaller cluster and it keeps on dividing until each cluster has a single object to represent.

 

86.How can we build data marts from FSLDM is there any standard approach for this?Is FSLDM supports place ? .

Teradata is like all other DBMS, we can create as many database as required. Data Mart are basically subject oriented and mainly based on business KPIs which end-users generally like to measure. You have to refer business requirement document for designing data mart.

 

87.Classifying data for successful modeling

Qualitative Data

Qualitative data are also called categorical data as they represent distinct categories rather than numbers.In case of dimensional modeling, they are often termed as “dimension“.Mathematical operations such as addition or subtraction do not make any sense on that data.

Example of qualitative data are, eye color, zip code, phone number etc.

Qualitative data can be further classified into below classes:

NOMINAL :

Nominal data represents data where order of the data does not represent any meaningful information.

Consider your passport number.There is no information as such if your passport number is greater or lesser than some one else’s passport number. Consider Eye color of people, does not matter in which order we represent the eye colors, order does not matter.

ID, ZIP code, Phone number, eye color etc. are example of nominal class of qualitative data.

ORDINAL :

Order of the data is important for ordinal data.

Consider height of people – tall, medium, short.Although they are qualitative but the order of the attributes does matter, in the sense that they represent some comparative information.Similarly, letter grades, scale of 1-10 etc. are examples of Ordinal data.

In the field of dimensional modeling, this kind of data are sometimes referred as non-additive facts.

Quantitative data

Quantitative data are also called numeric data as they represent numbers.In case of dimensional data modeling approach, these data are termed as “Measure“.

Example of quantitative data is, height of a person, amount of goods sold, revenue etc.

Quantitative attributes can be further classified as below.

INTERVAL :

Interval classification is used where there is no true zero point in the data and division operation does not make sense.

Bank balance, temperature in Celsius scale, GRE score etc. are the examples of interval class data.Dividing one GRE score with another GRE score will not make any sense.In dimensional modeling this is synonymous to semi-additive facts.

RATIO :

Ratio class is applied on the data that has a true “zero” and where division does make sense.Consider revenue, length of time etc. These measures are generally additive.

Below table illustrates different actions that are possible to implement on various data types

ACTIONS –> Distinct Order Addition Multiplication
Nominal Y
Ordinal Y Y
Interval Y Y Y
Ratio Y Y Y Y

Data Reconciliation

88.Data Reconciliation

  • Many of the data warehouses are built on n-tier architecture with multiple data extraction and data insertion jobs between two consecutive tiers.
  • As it happens, the nature of the data changes as it passes from one tier to the next tier.
  • Data reconciliation is the method of reconciling or tie-up the data between any two consecutive tiers (layers).

.Why Reconciliation is required?

  • In the process of extracting data from one source and then transforming the data and loading it to the next layer, the whole nature of the data can change considerably.
  • It might also happen that some information is lost while transforming the data.
  • A reconciliation process helps to identify such loss of information.

One of the major reasons of information loss is loading failures or errors during loading. Such errors can occur due to several reasons e.g.

  • Inconsistent or non coherent data from source
  • Non-integrating data among different sources
  • Unclean / non-profiled data
  • Un-handled exceptions
  • Constraint violations
  • Logical issues / Inherent flaws in program
  • Technical failures like loss of connectivity, loss over network, space issue etc.

Failure due to any such issue can result into potential information loss leading to unreliable data quality for business process decision making.

Further more, if such issues are not rectified at the earliest, this becomes even more costly to “patch” later. Therefore this is highly suggested that a proper data reconciliation process must be in place in any data Extraction-Transformation-Load (ETL) process.

.Methods of Data Reconciliation

Master Data Reconciliation

Master data reconciliation is the method of reconciling only the master data between source and target.

  • Master data are generally unchanging or slowly changing in nature and no aggregation operation is done on the dataset.
  • That is – the granularity of the data remains same in both source and target.
  • That is why master data reconciliation is often relatively easy and quicker to implement.

In one business process, “customer”, “products”, “employee” etc. are some good example of master data.

Ensuring the total number of customer in the source systems match exactly with the total number of customers in the target system is an example of customer master data reconciliation.

Some of the common examples of master data reconciliation can be the following measures,

  1. Total count of rows, example
    • Total Customer in source and target
    • Total number of Products in source and target etc.
  2. Total count of rows based on a condition, example
    • Total number of active customers
    • Total number of inactive customers etc.

Transactional Data Reconciliation

  • Sales quantity, revenue, tax amount, service usage etc. are examples of transactional data.
  • Transactional data make the very base of BI reports so any mismatch in transactional data can cause direct impact on the reliability of the report and the whole BI system in general.
  • That is why reconciliation mechanism must be in-place in order to detect such a discrepancy before hand (meaning, before the data reach to the final business users)
  • Transactional data reconciliation is always done in terms of total sum.
  • This prevents any mismatch otherwise caused due to varying granularity of qualifying dimensions.
  • Also this total sum can be done on either full data or only on incremental data set.

Some examples measures used for transactional data reconciliation can be

  1. Sum of total revenue calculated from source and target
  2. Sum of total product sold calculated from source and target etc.

Automated Data Reconciliation

  • For large warehouse systems, it is often convenient to automate the data reconciliation process by making this an integral part of data loading.
  • This can be done by maintaining separate loading metadata tables and populating those tables with reconciliation queries.
  • The existing reporting architecture of the warehouse can be then used to generate and publish reconciliation reports at the end of the loading.
  • Such automated reconciliation will keep all the stake holders informed about the trustworthiness of the reports.

.Scope of Data Reconciliation

  • Data reconciliation is often confused with the process of data quality testing.
  • Even worse, sometimes data reconciliation process is used to investigate and pin point the data issues.
  • While data reconciliation may be a part of data quality assurance, these two things are not necessarily same.
  • Scope of data reconciliation should be limited to identify, if at all, there is any issue in the data or not.
  • The scope should not be extended to automate the process of data investigation and pin pointing the issues.
  • A successful reconciliation process should only indicate whether or not the data is correct. It will not indicate why the data is not correct. Reconciliation process answers “what” part of the question, not “why” part of the question.

 

 89.Volumetric mis-judgement  ?

A very minutely done volumetric estimate in the starting phase of the project would go weary later. This happens due to several reasons e.g. slight change in the standard business metrics may create huge impact on the volumetric estimates.

For example, suppose a company has 1 million customers who are expected to grow at a rate of 7% per annum. While calculating the volume and size of your data warehouse you have used this measure in several places. Now if the customer base actually increase by 10% instead of 7%, that would mean 30000 more customers. In a fact table of granularity customer, product, day – this would mean 30000 X 10 X 365 more records ( assuming on average one customer use 10 products ). If one record takes 1kb, then the fact table would now require – ( 30000 X 365 X 10 X 1kb ) / ( 1024 X 1024 ) = 100+ GB more disk space from only one table.

Data Retention

90.What is data retention and Purging?

There are certain requirement to purge / archive / delete the data in data warehouse after a certain period of time, often termed as retention period of the data warehouse. Once a the retention period is reached, data from the data warehouse are purged or deleted or archived into separate place usually comprising of low cost storage medium (e.g. tape drive).

.Why data purging is required?

In a idealistic scenario, we assume data warehouse to store data for good. However there are some reasons why this might not be a good idea in a real scenario:

  • There are cost overhead associated with the amount of data that we store. This includes the cost of storage medium, infrastructure and human resources necessary to manage the data
  • There is direct impact of data volume to the performance of a data warehouse. More data means more time consuming sorting and searching operations
  • End users of the data warehouse in the business side may not be interested in the very old fact and figures.
  • Data might lose its importance and relevance with the changing business landscape. Retaining such impertinent data may not be required

.Variable purging requirement

  • The concept of data purging may not be applicable for all types of data in the warehouse.
  • Typically one may only want to purge the transactional data and may not want to purge any master data.
  • Similarly the retention period of the data warehouse may not be same for all types of data.
  • One may set a variable retention period such that all detail level transactional data will be purged after 2 years and all aggregated transaction data will be purged after 7 years.

.Rolling window for data purging

  • Process of purging is typically implemented as a rolling window wherein whatever data falls under the purging window are only purged.
  • So suppose if the retention period is set as 2 years and purging process is executed bi-monthly then after every 15 days, whenever the purging process runs it deletes all the records older than 2 years as of that day.

.Purging Master / Reference data

Purging of master data is not recommended unless:

  • There is specific reason why the master data will never be required in the future
  • History is maintained for the change in the master data (e.g. in the form of Slowly Changing Dimension etc.)

If history of the changes in the master data are maintained in the data warehouse, it might be necessary at some point to purge the records corresponding to old histories of the master data. When this is done, one must also keep in mind to clear all the corresponding transactional records pertaining to the master data. To understand the situation, consider this: I have a customer C1 for whom I have two records in my master data table with surrogate keys as K1 and K2. K1 corresponds to the customer C1’s attributes as of year 1996 and K2 corresponds to the customer C1’s attribute as of year 2000.

Surrogate Key Customer Name Customer Tier Year
K1 C1 Gold 1996
K2 C1 Platinum 2000

Now suppose there are 10 transactions pertaining to this customer C1 in the year 1996. In this circumstance, if we decide to purge the master record K1, then all the 10 transaction records would become orphan (meaning the key integrity between the master and transaction detail records will break). Ideally one must also purge these 10 transaction records while purging the corresponding master record.

In certain Relational databases (e.g. Oracle), this issue can be automatically taken care if foreign key constraints are enabled in the transaction detail table with “ON DELETE CASCADE” option.

.Purging Transactional data

Transactional data purging may be necessary when

  • Corresponding master data records (or parent record) are also purged
  • Data crosses the designated retention period

As discussed above, there can be variable period of retention for the transaction data. Aggregated transaction data may be retained for a longer period of time than the detailed transactions.

.Challenges faced during data warehouse data purging

Time. Purging takes time. And the process of purging may also slowdown legitimate daily activities of the data warehouse (e.g. loading may get slower/report may take more time to refresh etc.) Therefore one must plan and allocate enough time when the purging activity can be done. Weekends, holidays etc. can be a good time for data purging as number of users accessing the data warehouse are typically less at those times.

On the technical front one must also consider post purging activities that are required to be performed in the data warehouse. Since purging frees up lot of space – additional administrative activities are required to be performed to make use of newly freed-up space. This includes, but not limited to activities such as:

  • Tablespace re-organization
  • Index rebuilding
  • Defragmentation / consolidation etc.

Data Integration

91.Integrating Heterogeneous Databases

To integrate heterogeneous databases, we have two approaches −

  • Query-driven Approach
  • Update-driven Approach

Query-Driven Approach

This is the traditional approach to integrate heterogeneous databases. This approach was used to build wrappers and integrators on top of multiple heterogeneous databases. These integrators are also known as mediators.

Process of Query-Driven Approach

  • When a query is issued to a client side, a metadata dictionary translates the query into an appropriate form for individual heterogeneous sites involved.
  • Now these queries are mapped and sent to the local query processor.
  • The results from heterogeneous sites are integrated into a global answer set.

Disadvantages

  • Query-driven approach needs complex integration and filtering processes.
  • This approach is very inefficient.
  • It is very expensive for frequent queries.
  • This approach is also very expensive for queries that require aggregations.

Update-Driven Approach

This is an alternative to the traditional approach. Today’s data warehouse systems follow update-driven approach rather than the traditional approach discussed earlier. In update-driven approach, the information from multiple heterogeneous sources are integrated in advance and are stored in a warehouse. This information is available for direct querying and analysis.

Advantages

This approach has the following advantages −

  • This approach provide high performance.
  • The data is copied, processed, integrated, annotated, summarized and restructured in semantic data store in advance.
  • Query processing does not require an interface to process data at local sources.

92.What is Data Integration (DI)?

Data Integration is the process of combining heterogeneous data sources in to a single queriable schema so as to get an unified view of these data.

.Why Data Integration is required?

  • Often large companies and enterprises maintain separate departmental databases to store the data pertaining to the specific department.
  • Although such separations of the data provide them better manageability and security, performing any cross departmental analysis on these datasets becomes impossible.

For example, if marketing department and sales department maintain two secluded databases, then it might not be possible to analyze the effect of a certain advertising campaign by the marketing department on sales of a product. Similarly, if HR department and production department maintain their individual databases, it might not be possible to analyze the correlation between yearly incentives and employee’s productivity.

Data integration provides a mechanism to integrate these data from different departments into a single queriable schema.

Below is a list of examples where data integration is required. The list, however, is not comprehensive

  • Cross functional analysis – as discussed in the above example
  • Finding correlation – Statistical intelligence/scientific application
  • Sharing information – legal or regulatory requirements e.g. sharing customers’ credit information among banks
  • Maintaining single point of truth – Higher management topping over several departments may need to see a single picture of the business
  • Merger of Business – after merger two companies want to aggregate their individual data assets

.How data integration can be done? – different approaches to data integration

There are mainly 2 major approaches for data integration – commonly known as

  • “tight coupling approach”
  • “loose coupling approach”.

Tight Coupling: Data Warehousing

  • In case of tight coupling approach – which is often implemented through data warehousing, data is pulled over from disparate sources into a single physical location through the process of ETL – Extraction, Transformation and Loading.
  • The single physical location provides an uniform interface for querying the data.
  • ETL layer helps to map the data from the sources so as to provide a semantically uniform data warehouse.
  • This approach is called tight coupling since in this approach the data is tightly coupled with the physical repository at the time of query.

Loose Coupling: Virtual Mediated Schema

  • In contrast to tight coupling approach, a virtual mediated schema provides a interface that takes the query input from the user, transform the query in the way source database can understand and then sends the query directly to the source databases to obtain the result.
  • In this approach, the data does not really remain in the schema and only remain in the actual source databases.
  • However, mediated schema contains several “adapters” or “wrappers” that can connect back to the source systems in order to bring the data to the front end.
  • This approach is often implemented through middleware architecture (EAI).

Figure 1: Simple schematic for a data-integration

To understand “Virtual Mediated Schema” approach, consider this example. Let’s say one database stores weather data of a country for past 50 years. Another database contains crop production data of the country for each year. A user might want to query – “In which year crop production is lowest inspite of more than average rainfall?”. The question can be subdivided into 3 different questions as follows:

  1. What is the average rainfall across all the years?
  2. Which are the years where recorded rainfall was higher than the average?
  3. Out of all the years obtained from the above query, which year has least crop production?

The first 2 queries can be straightway sent to the weather database and the last query can be sent to the crop production database to get the answers.

In a mediated schema all the data of weather and crop productions will not be available in the schema itself, but the schema will have necessary “adapters” to send the queries back to appropriate databases and combine the result sets.

93.Why do we need Staging Area during ETL Load

We have a simple data warehouse that takes data from a few RDBMS source systems and load the data in dimension and fact tables of the warehouse.

 why you can’t avoid a staging area:

  • Source systems are only available for extraction during a specific time slot which is generally lesser than your overall data loading time.
  • It’s a good idea to extract and keep things at your end before you lose the connection to the source systems.
  • You want to extract data based on some conditions which require you to join two or more different systems together. E.g. you want to only extract those customers who also exist in some other system. You will not be able to perform a SQL query joining two tables from two physically different databases.
  • Various source systems have different allotted timing for data extraction.
  • Data warehouse’s data loading frequency does not match with the refresh frequencies of the source systems.
  • Extracted data from the same set of source systems are going to be used in multiple places (data warehouse loading, ODS loading, third-party applications etc.)
  • ETL process involves complex data transformations that require extra space to temporarily stage the data
  • There is specific data reconciliation / debugging requirement which warrants the use of staging area for pre, during or post load data validations
  1.  impact of having a stage area? Yes there are a few.
    • Staging area increases latency – that is the time required for a change in the source system to take effect in the data warehouse. In lot of real time / near real time applications, staging area is rather avoided.
    • Data in the staging area occupies extra space.

DATA Load

94.Methods of Incremental Loading in Data Warehouse

  • Incremental loading a.k.a Delta loading is an widely used method to load data in data warehouses from the respective source systems.
  • This technique is employed to perform faster load in less time utilizing less system resources.

What is Incremental Loading and why is it required

  •  we extract data from one or more source systems and keep storing them in the data warehouse for future analysis.
  • The source systems are generally OLTP systems which store everyday transactional data.

Now when it comes to loading these transactional data to data warehouse, we have 2 ways to accomplish this, Full Load or Incremental Load.

INCREMENTAL LOAD METHOD FOR LOADING DATA WAREHOUSE

We can make use of “entry date” field in the customer table and “sales date” field in the sales table to keep track of this. After each loading we will “store” the date until which the loading has been performed in some data warehouse table and next day we only extract those records that has a date greater than our stored date. Let’s create a new table to store this date. We will call this table as “Batch”

Batch

Batch_ID  Loaded_Until  Status
1         22-Mar-2012   Success
2         23-Mar-2012   Success

Once we have done this, all we have to do to perform incremental or delta loading is to rite our data extraction SQL queries in this format

Next day, the query “SELECT max(b.loaded_until)” will return me 22-Mar-2012. So in effect, above queries will reduce to this:

Why status field is created in batch table?

This is because it might so happen that 23rd load has failed. So when we start loading again on 24th, we must take into consideration both 23rd data and 24th data.

Batch_ID  Loaded_Until  Status
1         22-Mar-2012   Success
2         23-Mar-2012   Fail
3         24-Mar-2012   Success

In the above case, 23rd batch load was a failure. That is why next day we have selected all the data after 22-Mar (including 23rd and 24th Mar).

 

NORMALIZATION:

65808331

95.There are three types of database anomalies:

Update anomaly.

 

The same information can be expressed on multiple rows; therefore updates to the relation may result in logical inconsistencies. For example, each record in an “Employees’ Skills” relation might contain an Employee ID, Employee Address, and Skill; thus a change of address for a particular employee may need to be applied to multiple records (one for each skill). If the update is only partially successful – the employee’s address is updated on some records but not others – then the relation is left in an inconsistent state. Specifically, the relation provides conflicting answers to the question of what this particular employee’s address is. This phenomenon is known as an update anomaly.

Insertion anomaly

There are circumstances in which certain facts cannot be recorded at all. For example, each record in a “Faculty and Their Courses” relation might contain a Faculty ID, Faculty Name, Faculty Hire Date, and Course Code. Therefore, we can record the details of any faculty member who teaches at least one course, but we cannot record a newly hired faculty member who has not yet been assigned to teach any courses, except by setting the Course Code to null. This phenomenon is known as an insertion anomaly.

Deletion anomaly

 Under certain circumstances, deletion of data representing certain facts necessitates deletion of data representing completely different facts. The “Faculty and Their Courses” relation described in the previous example suffers from this type of anomaly, for if a faculty member temporarily ceases to be assigned to any courses, we must delete the last of the records on which that faculty member appears, effectively also deleting the faculty member, unless we set the Course Code to null. This phenomenon is known as a deletion anomaly.

96.What is database normalization?

Database Normalization is nothing but technique of designing the database in structured way to reduce redundancy and improve data integrity.

Database Normalization is used for following Purpose:

  • To Eliminate the redundant or useless data
  • To Reduce the complexity of the data
  • To Ensure the relationship between tables as well as data in the tables
  • To Ensure data dependencies and data is logically stored.

97.What are different check points for normalizing data?

There are following checkpoints to normalize the data :

  • Arrangement of data into logical groups.
  • Minimize the Duplicate data.
  • Organize  the data in such way that when modification needed then there should be only one place modification required.
  • User can access and manipulate data quickly and efficiently
98.Why Denormalization is promoted in Universe Designing?
  •  In a relational data model, for normalization purposes, some lookup tables are not merged as a single table.
  • In a dimensional data modeling (star schema), these tables would be merged as a single table called DIMENSION table for performance and slicing data.
  • Due to this merging of tables into one large Dimension table, it comes out of complex intermediate joins.
  • Dimension tables are directly joined to Fact tables.
  • Though, redundancy of data occurs in DIMENSION table, size of DIMENSION table is 15% only when compared to FACT table.
  • So only Denormalization is promoted in Universe Designing.
  • De-Normalization is a process of adding redundancy to the data.
  • This helps to quickly retrieve the information from the database

.When should you consider denormalization?

Denormalization is used when there is a lot of tables involved in retrieving data. Denormalization is done in dimensional modelling used to construct a data warehouse. This is not usually done for databases of transactional systems.

 

  • Denormalization is used when there are a lot of tables involved in retrieving data.
  • Denormalization is done in dimensional data modelling used to construct a data ware house.
  • This is not usually done for data bases of transactional systems.
99.What is Normalization, First Normal Form, Second Normal Form ,Third Normal Form?
 Normalization is process for assigning attributes to entities
  • Reduces data redundancies
  • Helps eliminate data anomalies
  • Produces controlled redundancies to link tables
Normalization is the analysis of functional dependency between attributes /data items of user views,It reduces a complex user view to a set of small and stable subgroups of fields / relations 
INF:
Repeating groups must be eliminated,
Dependencies can be identified,
All key attributes defined,
No repeating groups in table 
2NF:
The Table is already in INF
Includes no partial dependencies–
No attribute dependent on a portion of primary key,
Still possible to exhibit transitive dependency,Attributes may be functionally dependent on non-key attributes
3NF:
The Table is already in 2NF,
Contains no transitive dependencies.
normalization

images

database-normalization-56-638

mca-iidbmsuivstructured-query-language-24-638Modification+AnomalyNormalizationnormalization_steps1slide_2slide_6 (1)slide_6

 

116.Is this statement TRUE or FALSE? all databases must be in third normal form?

In general all organization data bases are normalised to 3nf in order to remove redundancy and efficient access.A database can also be created without normalisation.Hence it is not a mandatory that a database should be in 3nf.

What is first normal form ? Explain with example.

The first normal form is the normal form of database where data must not contain repeating groups.The database is in First normal form If,

1.It contains only automic values.

Atomic values:- The Single cell have only single value

2.Each Record needs to be unique and there are no repeating groups.

Repeating Groups:- Repeating group means a table contains 2 or more values of columns that are closely related.

Example:

Consider following table which is not normalized:

Employee Table:

Employee No Employee Name Department
1 Amit OBIEE,ETL
2 Divya COGNOS
3 Rama Administrator

To bring it in to first normal form We need to split table into 2 tables.

First table:Employee Table

Employee No Employee Name
1 Amit
2 Divya
3 Rama

Second Table: Department table

Employee No Department
1 OBIEE
1 ETL
2 COGNOS
3 Administrator

We have divided the table into two different tables and the column of each table is holding the automic values and duplicates also removed.

 

 

What is second normal form?

An entity is in the second normal form if all of its attributes depend on the whole (primary) key. In relational terms, every column in a table must be functionally dependent on the whole primary key of that table. Functional dependency indicates that a link exists between the values in two different columns.

If the value of an attribute depends on a column, the value of the attribute must change if the value in the column changes. The attribute is a function of the column. The following explanations make this more specific:

  • If the table has a one-column primary key, the attribute must depend on that key.
  • If the table has a composite primary key, the attribute must depend on the values in all its columns taken as a whole, not on one or some of them.
  • If the attribute also depends on other columns, they must be columns of a candidate key; that is, columns that are unique in every row.
  • If you do not convert your model to the second normal form, you risk data redundancy and difficulty in changing data. To convert first-normal-form tables to second-normal-form tables, remove columns that are not dependent on the primary key.

 

 Explain Second Normal Form with example

The data is said to be in second normalized form If,

1.It is in First normal form

2.There should not be any partial dependency of any column on primary key.Means the table have concatenated primary key and each attribute in table depends on that concatenated primary key.

3.All Non-key attributes are fully functionally dependent on primary key.If primary is is not composite key then all non key attributes are fully functionally dependent on primary key.

Example:

Let us consider following table which is in first normal form:

Employee No Department No Employee Name Department
1 101 Amit OBIEE
2 102 Divya COGNOS
3 101 Rama OBIEE

In above example we can see that department .Here We will see that there is composit key as{ Employee No,Department No}.Employee No is dependent on Employee Name and Department is dependent on Department No.We can split the above table into 2 different tables:

Table 1:Employee_NO table

Employee No Department No Employee Name
1 101 Amit
2 102 Divya
3 101 Rama

Table 2:Department table

Department No Department
101 OBIEE
102 COGNOS

Now we have simplified the table in to second normal form where each entity of table is functionally dependent on primary key.

What is third normal form?

An entity is in the third normal form if it is in the second normal form and all of its attributes are not transitively dependent on the primary key. Transitive dependence means that descriptor key attributes depend not only on the whole primary key, but also on other descriptor key attributes that, in turn, depend on the primary key. In SQL terms, the third normal form means that no column within a table is dependent on a descriptor column that, in turn, depends on the primary key.

For 3NF, first, the table must be in 2NF, plus, we want to make sure that the non-key fields are dependent upon ONLY the PK, and not other non-key fields for its existence. This is very similar to to 2NF, except that now you are comparing the non-key fields to OTHER non-key fields. After all, we know that the relationship to the PK is good, because we established that in 2NF.

 Explain Third Normal Form with example.

The database is in Third normal form if it satisfies following conditions:

1.It is in Second normal form

2.There is no transitive functional dependency

Transitive Dependency:

When table 1 is Functionally dependent on table 2. and table 2 is functionally dependent on table 3 then.table 3 is transitively dependent on table 1 via table 2.

Example:

Consider following table:

Employee No Salary Slip No Employee Name Salary
1 0001 Amit 50000
2 0002 Divya 40000
3 0003 Rama 57000

In above table Employee No determines the Salary Slip No.And Salary Slip no Determines Employee name.Therefore Employee No determines Employee Name.We have transitive functional dependency so that this structure not satisfying Third Normal Form.

For That we will Split tables into following 2 tables:

Employee table:

Employee No Salary Slip No Employee Name
1 0001 Amit
2 0002 Divya
3 0003 Rama

Salary Table:

Salary Slip No Salary
0001 50000
0002 40000
0003 57000

Following are 2 Advantages of 3rd normal form:

1.Amount of data duplication is removed because transitive dependency is removed in third normal form.

2.Achieved Data integrity.

Explain Boyce Code Normal Form with example.

BCNF Normal form is higher version of third normal form.This form is used to handle analomies which are not handled in third normal form. BCNF does not allow dependencies between attributes that belongs to candidate keys.It drops restriction of the non key attributes from third normal form.

Third normal form and BCNF are not same if following conditions are true:

1.The table has 2 or more candidate keys

2.At least two of candidate keys are composed of more than 1 attribute

3.The keys are not disjoint.

Example:

Address-> {City,Street,Zip}

Key 1-> {City,Zip}

Key 2->{City,Street}

No non key attribute hence this example is of 3 NF.

{City,Street}->{zip}

{Zip}->{City}

There is dependency between attributes belonging to key.Hence this is BCNF.

 

 

 

 

Comparison

1

6433.SQL vs NoSQL (1)

sqlvnosql

Untitled-picture

0_poJcCxM33T7Ae9tw

700 x 400_NOSQL-2

bigdata-scalein-architecture

Cloud+Offerings–+RDBMS+AND+NoSQL

introduction-to-column-oriented-databases-14-638

nosql

nosql-vs-sql-overview-1

plm-database-tech-spec