DATAWAREHOUSING

DATAWAREHOUSING   

Data_warehouse

ODSHighLevel

ODS1

ods_dw

343002

DWH_P004_001

1_cuvK-o4OvDEM5iBJPmXb0Q

Comparison

1

6433.SQL vs NoSQL (1)

sqlvnosql

Untitled-picture

0_poJcCxM33T7Ae9tw

700 x 400_NOSQL-2

bigdata-scalein-architecture

Cloud+Offerings–+RDBMS+AND+NoSQL

introduction-to-column-oriented-databases-14-638

nosql

nosql-vs-sql-overview-1

plm-database-tech-spec

1Capture

1.Difference between Inmon and Kimball philosophies of data warehousing?

Both differ in the concept of building the data warehouse.

Inmon explains in creating a data warehouse on a subject-by-subject area basis. Hence, the development of the data warehouse can start with data from the online store. Other subject areas can be added to the data warehouse as their needs arise. Point-of-sale (POS) data can be added later if management decides it is necessary.

Kimball views data warehousing as a constituency of Data marts. Data marts are focused on delivering business objectives for departments in the organization. And the data warehouse is a conformed dimension of the data marts. Hence, a unified view of the enterprise can be obtained from the dimension modeling on a local departmental level.

• Hence, Kimball–First Data Marts–Combined way —Data warehouse

Inmon—First Data warehouse–Later—-Data marts

2.What is a data warehousing?
  • Data Warehouse is a repository of integrated information, available for queries and analysis.
  • Data and information are extracted from heterogeneous sources as they are generated.
  • This makes it much easier and more efficient to run queries over data that originally came from different sources.
  • Typical relational databases are designed for on-line transactional processing(OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP).
  • As a result, data warehouses are designed differently  than traditional relational databases.
  •  data ware housing is analyzing the business this is timevarient, non volatile, integrated and subject oriented a data ware house is support decision making.
  • A data warehouse is a logical collection of information gathered from many different operational databases used to create business intelligence that supports business analysis activities and decision-making tasks, primarily, a record of an enterprise is past transactional and operational information, stored in a database designed to favor efficient data analysis and reporting (especially OLAP)
  • A Data Warehouse can be defined as a centralized, consistent data store or Decision Support System (OLAP), for the end business users for analysis, prediction and decision making in their business operations. Data from various enterprise-wide application/transactional source systems (OLTP), are extracted, cleansed, integrated, transformed and loaded in the Data Warehouse.

3.Properties of a Data Warehouse

Subject-oriented Data Warehouse is designed based on the major subjects areas of the business, rather than the major application areas of the enterprise. The ultimate necessity is to store decision-support data rather than application-oriented data.

Integrated The data in Data Warehouse comes from different enterprise-wide application source systems. The source data is often inconsistent in nature. During the Integration process, source data must be made consistent in Data Warehouse so that the data becomes homogeneous and uniform.

Time-variant The source data in the Data Warehouse is only accurate and valid at some point in time or over some time interval.

Non-volatile Data in Data Warehouse is not updated or replaced in real time but is refreshed from operational systems on a regular basis. New data is always added incrementally integrating it with the previous data set.

Decision Support System (DSS) is a class of information systems (including but not limited to computerized systems) that support business and organizational decision-making activities. A properly designed DSS is an interactive software-based system intended to help decision makers compile useful information from a combination of raw data, documents, personal knowledge, or business models to identify and solve problems and make decisions.

4.Why we go for Data Warehouse instead of OLTP Reporting and Analysis

  1. OLAP reporting requires historical data whereas the operational databases do not maintain history and updates the transactions.
  2. Data in Data Warehouse is loaded after consolidation, integration, aggregation as well as summarization of data from various heterogeneous OLTP source systems. Hence in that case we will fail to generate OLTP reporting.
  3. Also the data in the different source systems uses inconsistent data representations, codes, and formats which have to be reconciled before loading into Data Warehouse.
  4. Finally, complex OLAP queries will obviously degrade the performance of operational / transactional source systems. The OLTP databases are designed and tuned to support insert and update operations mainly

5.Why Data Warehouse is used?

  • For a long time in the past and also even today, Data warehouses are built to facilitate reporting on different key business processes of an organization, known as KPI.
  • Today we often call this whole process of reporting data from data warehouses as “Data Analytics“.
  • Data warehouses also help to integrate data from different sources and show a single-point-of-truth values about the business measures (e.g. enabling Master Data Management).
  • Data warehouse can be further used for data mining which helps trend prediction, forecasts, pattern recognition etc.
  1. Faster Data Retrieval
  2. Better Understandability
  3. Extensibility

6.Application of data warehousing system?

  • Consistent and Quality of data: Data warehouse system is consistent and cost-effective for various industries for collection their customer data through various resources
  • Cost reduction: A data warehouse system reduce cost by storing all electronically collected data in a data warehouse
  • Accessibility: A data warehouse system data can have easy access to data time to time for business improvement and reporting.

Data warehouse Architecture

As discussed before, a data warehouse helps business executives to

  • organize,
  • analyze, and
  • use their data for decision making.
  • Tuning Production Strategies− The product strategies can be well tuned by repositioning the products and managing the product portfolios by comparing the sales quarterly or yearly.
  • Customer Analysis − Customer analysis is done by analyzing the customer’s buying preferences, buying time, budget cycles, etc.
  • Operations Analysis − Data warehousing also helps in customer relationship management, and making environmental corrections. The information also allows us to analyze business operations.
  • Information Processing − A data warehouse allows to process the data stored in it. The data can be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts, or graphs.
  • Analytical Processing − A data warehouse supports analytical processing of the information stored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and pivoting.
  • Data Mining − Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction. These mining results can be presented using the visualization tools

A data warehouse serves as a sole part of a plan-execute-assess “closed-loop” feedback system for the enterprise management. Data warehouses are widely used in the following fields −

  • Financial services
  • Banking services
  • Consumer goods
  • Retail sectors
  • Controlled manufacturing

7.Types of Data Warehouse Architectures

Single-tier architecture
  • The objective of a single layer is to minimize the amount of data stored by removing data redundancy.
  • It is not frequently used in practice.
Two-tier architecture
  • This architecture separates physically available sources from the data warehouse.
  • This architecture is not expandable & does not support a large number of end-users.
  • Because of network limitations, this architecture faces connectivity issues.
Three-tier architecture
  • It is the most widely used architecture that is consist of the Top, Middle and Bottom Tier.
  • Bottom Tier:  Usually a relational database of the Datawarehouse serves as the bottom tier where Data is cleansed, transformed, and loaded.
  • Middle Tier: This application tier is an OLAP server & presents an abstracted view of the database which acts as a mediator between the end-user and the database.
  • Top-Tier: The top tier is a front-end client layer channels data out of the data warehouse.

8.Functions of Data Warehouse Tools and Utilities

The following are the functions of data warehouse tools and utilities −

  • Data Extraction− Involves gathering data from multiple heterogeneous sources.
  • Data Cleaning− Involves finding and correcting the errors in data.
  • Data Transformation − Involves converting the data from legacy format to warehouse format.
  • Data Loading− Involves sorting, summarizing, consolidating, checking integrity, and building indices and partitions.
  • Refreshing− Involves updating from data sources to warehouse.

Note − Data cleaning and data transformation are important steps in improving the quality of data and data mining results.

9.What are the features of  Warehouse?

  • It is separate from Operational Database.
  • Integrates data from heterogeneous systems.
  • Stores HUGE amount of data, more historical than current data.
  • Does not require data to be highly accurate.
  • Queries are generally complex.
  • Goal is to execute statistical queries and provide results which can influence decision making in favor of the Enterprise.
  • These systems are thus called Online Analytical Processing Systems (OLAP).

10.What are different characteristics of Data Warehouse?

  • Data warehouse is a database which is separate from operational database which stores historical information also.
  • Data warehouse database contains transactional as well as analytical data.
  • Data warehouse helps higher management  to take strategic as well as tactical decisions using historical or current data.
  • Data warehouse helps consolidated historical data analysis.
  • Data warehouse helps business user to see the current trends to run the business.
  • Data warehouse is used for reporting and data analysis purpose.
  •  The part of data can be denormalized so that it can be simplified and improve the performance of the same.
  • A lot of queries are involved where lot of data is also retrieved based on the queries
  •   The data load is controlled
  • Ad hoc queries and planned queries are very common when it comes to data extraction.

11.What is snapshot with reference to data warehouse?

  • Snapshot refers to a complete visualization of data at the time of extraction.
  • It occupies less space and can be used to back up and restore data quickly.
12.What is real time data-warehousing?
 Real-time data warehousing is a combination of two things:
  • real-time activity
    • Real-time activity is activity that is happening right now.
    • The activity could be anything such as the sale of widgets.
    • Once the activity is complete, there is data about it.
  • data warehousing.
    • Data warehousing captures business activity data.
    • Real-time data  warehousing  captures business activity data as it occurs.
    • As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly.
    • In other words, real-time data warehousing is a framework for deriving information from data as the data becomes available.
13. What is Virtual Data Warehousing?
  • A virtual data warehouse provides a collective view of the completed data. A virtual data warehouse has no historic data. It can be considered as a logical data model of the containing metadata.
  • Virtual data warehousing is a ‘de facto’ information system strategy for supporting analytical decision making. It is one of the best ways for translating raw data and presenting it in the form that can be used by decision makers. It provides semantic map – which allows the end user for viewing as virtualized.
  •  A virtual data warehouse provides a collective view of the completed data.
  • A virtual data warehouse has no historic data.
  • It can be considered as a logical data model of the containing metadata.
14.What is active data warehousing?
  • An active data warehouse represents a single state of the business. Active data warehousing considers the analytic perspectives of customers and SUPPLIERS. It helps to deliver the updated data through reports.
  • A form of repository of captured transactional data is known as ‘active data warehousing’. Using this concept, trends and patterns are found to be used for future decision making.
  • Active data warehouse has a feature which can integrate the changes of data while scheduled cycles refresh.
  • Enterprises utilize an active data warehouse in drawing the company’s image in statistical manner.

15.What are fundamental stages of Data Warehousing 

Datawarehouse

  • Offline Operational Database
  • Offline Data Warehouse
  • Real Time Datawarehouse
  • Integrated Datawarehouse
Offline Operational Databases – Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system’s performance.
Offline Data Warehouse – Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure.
Real Time Data Warehouse – Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)
Integrated Data Warehouse – Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.

9Capture

16.What are the steps to build the datawarehouse?
  • Gathering business requirements,
  • Identifying Sources
  • Identifying Facts/measures
  • ,Defining Dimensions,
  • Define Attributes,
  • Redefine Dimensions & Attributes Organise
  • Attribute Hierarchy & Define Relationship,
  • Assign Unique Identifiers,
  • Additional conventions: Carnality/Adding ratios.
1 business modeling
2 data modeling
3 data from the source databases
Extraction Transformation Loading
5 DataWare house (Data Marts)

Following are the steps to be followed to build the datawaerhouse:

  • Gathering business requirements
  • Identifying the necessary sources
  • Identifying the facts
  • Defining the dimensions
  • Defining the attributes
  • Redefine the dimensions and attributes if required
  • Organize the Attribute hierarchy
  • Define Relationships
  • Assign unique Identifiers
  • Additional conventions:Cardinality/Adding ratios

18.What are operational DBMS?

They consist of Tables having attributes and are populated by tuples.

  • They generally use the E-R data model.
  •  It is used to store transactional data.
  •  The information content is generally recent.
  •  These are thus called as OLTP systems.
  •  Their goals are data accuracy & consistency , Concurrency , Recoverability, Reliability (ACID Properties).
19.Compare Database & Data Warehouse
Criteria Database Data Warehouse
Type of data Relational or object-oriented data Large volume with multiple data types
Data operations Transaction processing Data modeling and analysis
Dimensions of data Two dimensional Multi-dimensional
Data design ER based and application-oriented Star/Snowflake schema and subject-oriented
Size of data Small ( in GB) Large ( in TB)
Functionality High availability & performance High flexibility and user autonomy
  • The data warehouse supports dimensional modeling which is a design technique to support end-user queries.
20.What are the differences between a RDBMS schema and a data warehouse schema?
RDBMS Schema
  • Used for OLTP systems
  •  Highly Normalized
  • Difficult to understand and navigate
  • Difficult to extract and solve complex problems
DWH Schema
  • Used for OLAP systems
  • De-normalized
  • Easy to understand and navigate
  • Relatively easier in extracting the data and solving complex problems
21.What is the purpose of cluster analysis in Data Warehousing?
  • Cluster analysis is used to define the object without giving the class label.
  • It analyzes all the data that is present in the data warehouse and compare the cluster with the cluster that is already running.
  • It performs the task of assigning some set of objects into the groups also known as clusters.
  • It is used to perform the data mining job using the technique like statistical data analysis. It includes all the information and knowledge around many fields like machine learning, pattern recognition, image analysis and bio-informatics.
  • Cluster analysis performs the iterative process of knowledge discovery and includes trials and failures.
  • It is used with the pre-processing and other parameters as a result to achieve the properties that are desired to be used.

22.In which phase of the Active Data Warehouse evolution do you use data to determine what will happen?

In predicting phase. Analysts utilize the system to leverage information to predict what will happen next in the business to proactively manage the organization’s strategy.This stage requires data mining tools and building predictive models using historical detail. As an example, users can model customer demographics for target marketing.

23.What is the difference between Data warehousing and Business Intelligence
 Data warehousing deals with all aspects of managing the 
  • development,
  • implementation
  • operation of a data warehouse or data mart
    •   including meta data management,
    • data acquisition, data cleansing,
    • data transformation,
    • storage management,
    • data distribution,
    • data archiving,
    • operational reporting,
    • analytical reporting,
    • security management,
    • backup/recovery planning, etc.
Business intelligence, on the other hand, is a set of software tools that enable an organization to
  • analyze measurable aspects of their business such as
  • sales performance, profitability, operational 
  • efficiency,
  • effectiveness of marketing campaigns,
  • market penetration among certain customer groups,
  • cost trends,
  • anomalies and exceptions, etc.
Typically, the term “business intelligence” is used to encompass
  • OLAP,
  • data visualization,
  • data mining 
  • query/reporting tools.
24.Why is chameleon method used in data warehousing?
  • Chameleon is a hierarchical clustering algorithm that overcomes the limitations of the existing models and the methods present in the data warehousing.
  • This method operates on the sparse graph having nodes: that represent the data items, and edges: representing the weights of the data items.
  • This representation allows large dataset to be created and operated successfully. The method finds the clusters that are used in the dataset using two phase algorithm.
    • The first phase consists of the graph partitioning that allows the clustering of the data items into large number of sub-clusters.
    • Second phase uses an agglomerative hierarchical clustering algorithm to search for the clusters that are genuine and can be combined together with the sub-clusters that are produced.
25.What is snapshot with reference to data warehouse?
  • Snapshot refers to a complete visualization of data at the time of extraction. It occupies less space and can be used to back up and restore data quickly.
  • A snapshot is a process of knowing about the activities performed. It is stored in a report format from a specific catalog. The report is generated soon after the catalog is disconnected.

26.What are loops in Datawarehousing?

In datawarehousing, loops are existing between the tables. If there is a loop between the tables, then the query generation will take more time and it creates ambiguity. It is advised to avoid loop between the tables

27.What is included under Data Warehouse and what are the benefits of having it in an organization?

The following activities are involved in Data Warehouse:

  •  Retrieving the data
  • Analyzing the data
  • Extraction of the data, sometimes customized extraction is also applicable
  • Loading of the data
  • Managing and transforming the data

The following are the benefits of the Data Warehouse implementation:

  •  It helps us to maintain a copy of the information from different source transaction systems.
  •  It helps to gather data from different systems into one
  • Helps us in maintaining data history, most of the time this cannot be maintained at source data system
  •  Improves the data quality, as the data is cleansed and transformed so that it matches exactly the way the data is needed
  • Help in data reconstruction so that the raw data that has been gathered can make sense to the relevant users.
  •  More importantly, it saves times
  • Generates High ROI because the data that has been gathered so far is helpful for the business users where they consider the feedback from the data analysis.

28.What is a subject-oriented data warehouse?

Subject-oriented data warehouses are those that store data around a particular “subject” such as customer, sales, product, among others.

29.Explain Data Mart Vs Data Warehouse?

Data warehouse Data mart
Data warehousing is subject oriented, time variant, non-volatile collection of data which is used for creation of transactional reports as well as historical reports. Data Mart is simply a subset of Organization’s Data warehouse
Definition:The Data Warehouse is a large repository of data collected from different organizations or departments within a corporation.

Definition:The data mart is an only subtype of a Data Warehouse. It is designed to meet the need of a certain user group.
Focus:Data warehouse focuses on multiple business areas. Focus:Data mart focuses only on single subject area.
Usage:It helps to take a strategic decision. Usage:The data mart is used to take tactical decisions for growth of business.
Type of system :This is centralized system where one fact is at center surrounded by dimension tables. Type of system :Data mart system is de centralized system
Scope of Business:The Data warehouse will cover all business areas. Scope of Business:The scope of Data mart is within the line of the Business.
Data Model:Data warehouse always follows top-down model Data Model:Data mart always follows bottom-up model.
Data Size:Data warehouse contains all historical data so the database size is large.Approximate size of data warehouse is greater than 100 GB. Data Size:Data mart contains data with only one business area so the size of database is smaller than data warehouse.
Source:Data warehouse data comes from multiple heterogeneous data sources. Source:Data mart data is data of only one business area.Many times it will come from only one data source.
Implementation Time:Data warehouse contains all data which will come from multiple data sources. It will take time to build data warehouse. The Time to build data warehouse is months to years. Implementation Time:Data mart is small data warehouse which will contain the data of only a single business area. The implementation time to build data mart is in months.

30.What is ODS?

  • ODS is abbreviated as Operational Data Store and it is a repository of real time operational data rather than long term trend data
  •  A collection of operation or bases data that is extracted from operation databases and standardized, cleansed, consolidated, transformed,and loaded into an enterprise data architecture.
  • An ODS is used to support data mining of operational data, or as the store for base data that is summarized for a data warehouse.
  • The ODS may also be used to audit the data warehouse to assure summarized and derived data is calculated properly.
  • The ODS may further become the enterprise shared operational database, allowing operational systems that are being re engineered to use the ODS as there operation databases.
  • An operational data store (“ODS“) is a database designed to integrate data from multiple sources for additional operations on the data.
  • Unlike a master data store, the data is not sent back to operational systems.
  • It may be passed for further operations and to the data warehouse for reporting.
  • In ODS, data can be scrubbed, resolved for redundancy and checked for compliance with the corresponding business rules.
  • This data store can be used for integrating disparate data from multiple sources so that business operations, analysis and reporting can be carried while business operations occur.
  • This is the place where most of the data used in current operation is housed before it’s transferred to the data warehouse for longer term storage or archiving.
  •  An ODS is designed for relatively simple queries on small amounts of data(such as finding the status of a customer order), rather than the complex queries on large amounts of data typical of the data warehouse.
  •  An ODS is similar to your short term memory where it only stores very recent information. On the contrary, the data warehouse is more like long term memory storing relatively permanent information

 31.What is a common data source for the central enterprise data warehouse?

operational data stores (ODS)

32.Comparison between Data Warehouse and Mediated Schema approaches

Tight-Coupling (data warehouse)
Loose Coupling (mediated schema)
Advantages
  1. Independence (Lesser dependency to source systems since data is physically copied over)
  2. Faster query processing
  3. Complex query processing
  4. Advanced data summarization and storage possible
  5. High Volume data processing
Advantages
  1. Data Freshness (low latency – almost realtime)
  2. Higher Agility (when a new source system comes or existing source system changes – only the corresponding adapter is created or changed – largely not affecting the other parts of the system)
  3. Less costlier (Lot of infrastructure cost can be saved since data localization not required)
Disadvantages
  1. Latency (since data needs to be loaded using ETL)
  2. Costlier (data localization, infrastructure, security)
Disadvantages
  1. Semantic conflicts (The meaning of the measure “net profit” can be different in different systems – so semantic awareness is often necessary in the mediated schema*)
  2. Slower query response (due to network / bandwidth issue, non-localization of data / work load on the source system etc.)
  3. High order dependency to the data sources

* Semantic awareness can be achieved by the process of metadata publishing

33.What are different types of Data warehouse Systems?

  • Data Mart
  • Online Analytical Processing (OLAP)
  • Online Transactional Processing
  • Predictive Analysis

Data warehouse Interview Questions

34.What is the definition of Cube in Datawarehousing?

Cubes are logical representation of multidimensional data. The edge of the cube has the dimension members,and the body of the cube contains the data values

35.What is a data warehouse? List the types of Data warehouse architectures.

  • A data warehouse is the electronic storage of an organization’s historical data for the purpose of data analytics.
  •  data warehouse contains a wide variety of data that supports the decision-making process in an organization.

There are mainly 3 types of data warehouse architectures:

36.Define data analytics in the context of data warehousing.

  • Data analytics is the science of examining raw data with the purpose of drawing business-driven conclusions about that data.
  • The role of a data warehouse is to enable data analysis.

37.What is data mart?

  • Data marts are generally designed for a single subject area.
  • An organization may have data pertaining to different departments like Finance, HR, Marketing etc. stored in data warehouse and each department may have separate data marts.
  • These data marts can be built on top of the data warehouse.
  • A data mart (DM) is a specialized version of a data warehouse (DW)
  • Like data warehouses, data marts contain a snapshot of operational data that helps business people to strategize based on analyses of past trends and experiences.
  • The key difference is that the creation of a data mart is predicated on a specific, predefined need for a certain grouping and configuration of select data.
  • A data mart configuration emphasizes easy access to relevant information 
  • Data Marts are designed to help manager make strategic decisions about their business.
Data Marts are subset of the corporate-wide data that is of value to a specific group of users.
There are two types of Data Marts
  • Independent data marts

 sources from data captured form OLTP system, external providers or from data generated locally within a particular department or geographic area

  • Dependent data mart
sources directly from enterprise data warehouses.

 38.What is data marting? Explain the different kinds of costs associated?

  • The data marting is also called as a “data mart”.
  • A data mart is nothing but a process of redefining information about a specific data set that makes sense for a particular group.

The different kinds of costs associated with data marting are as follows:

  •  Hardware related costs
  •  Software related costs
  • Network access-related costs
  • Time costs

39.Why Do We Need a Data Mart?

Listed below are the reasons to create a data mart −

  • To partition data in order to impose access control strategies.
  • To speed up the queries by reducing the volume of data to be scanned.
  • To segment data into different hardware platforms.
  • To structure data in a form suitable for a user access tool.

40.Critical Features of an ETL framework

In a very broad sense, here are a few of the features that we feel critical in any ETL framework

  • Support for Change Data Capture Or Delta Loading Or Incremental Loading
  • Metadata logging
  • Handling of multiple source formats
  • Restartability support
  • Notification support
  • Highly configurable / customizable

41.Good-to-have features of ETL Framework

These are some good-to-have features for the framework

  • Inbuilt data reconciliation
  • Customizable log format
  • Dry-load enabling
  • Multiple notification formats