DATA LOAD
58.How can we build data marts from FSLDM is there any standard approach for this?Is FSLDM supports place ? .
Teradata is like all other DBMS, we can create as many database as required. Data Mart are basically subject oriented and mainly based on business KPIs which end-users generally like to measure. You have to refer business requirement document for designing data mart.
59.What is a pseudo key? What is its use?
If a table in a database either has no natural key or the natural key is so long that using it as the key would be inefficient then we need to consider using a pseudo key instead. Pseudo keys are usually given the same name as the table they belong to with _id added to the end of the name. They usually are defined to contain a number and use auto increment in my SQL or the equivalent in other versions of SQL to dynamically allocate a value when new rows are created.
60.What are the languages used in Data cleansing?
R – Programming language, SQL– Structure Query Language, Advance Excel Macros.
61.How will you do data transformation in data warehouse platform?
Data will be transferred by different database tools such as MySQL Tools, MS-Access Tools this tools is connected to a server such as SQL Server, Oracle Server. After setting up the environment by using SQL with the support of Shell scripting language will be able to transfer the data to the data warehouse system.
62.What are the techniques used in data transformation?
SQL: Structure Query Language is mainly used for data transformation. By using a SELECT command of Structure query language and from shell scripting language SCP and SSH command is used to connect to the data warehouse server for data transformation.
63.Define Transformation ?
- Transformation is the core of the reconciliation phase.
- It converts data from its operational source format into a specific data warehouse format.
- If you implement a three-layer architecture, this phase outputs your reconciled data layer.
64.What is mean by Granularity?
- Granularity in table represents the level of information stored in the table.
- In BI granularity is very important concept to check the table data.
- The granularity is high and low .
- High granularity data contains the data with high information or you can say it as transaction level data is high granularity data.
- Low granularity means data has low level information only.
- Fact table always have low granularity mean we need very low level data in fact table.
Following 2 points are important in defining granularity :
- Determining the dimensions that are to be included
- Determining location to place hierarchy of each dimension of information.
Real life Example :
Date Dimension Granularity level :
Year,month,quarter,period,week,day
65.What are different stages of Business Intelligence?
- Data Source
- Data Analysis
- Decision making support
- Situation Awareness
- Risk Management
66.Why is it Necessary to Partition?
Partitioning is important for the following reasons −
- For easy management,
- To assist backup/recovery,
- To enhance performance.
Horizontal Partitioning
- Partitioning by Time into Equal Segments
- Partition by Time into Different-sized Segments
- Partition on a Different Dimension
- Partition by Size of Table
- Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the dimensions. Here we have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order to apply comparisons, that dimension may be very large. This would definitely affect the response time.
Round Robin Partitions
In the round robin technique, when a new partition is needed, the old one is archived. It uses metadata to allow user access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the data warehouse.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts how vertical partitioning is done.
Vertical partitioning can be performed in the following two ways −
- Normalization
- Row Splitting
67.What is ETL?
ETL is abbreviated as Extract, Transform and Load. ETL is a software which is used to reads the data from the specified data source and extracts a desired subset of data. Next, it transform the data using rules and lookup tables and convert it to a desired state.
Then, load function is used to load the resulting data to the target database.
- ETL is abbreviation of extract, transform, and load.
- ETL is software that enables businesses to consolidate their disparate data while moving it from place to place, and it doesn’t really matter that data is in different forms or formats.
- The data can come from any source.
- ETL is powerful enough to handle such data disparities.
- First, the extract function reads data from a specified source database and extracts a desired subset of data.
- Next, the transform function works with the acquired data – using rules or lookup tables, or creating combinations with other data – to convert it to the desired state.
- Finally, the load function is used to write the resulting data to a target database´.
68.What is slicing-dicing?
- Slicing means showing the slice of a data, given a certain set of dimension (e.g. Product) and value (e.g. Brown Bread) and measures (e.g. sales).
- Dicing means viewing the slice with respect to different dimensions and in different level of aggregations.
- Slicing and dicing operations are part of pivoting.
69.What is drill-through?
- Drill through is the process of going to the detail level data from summary data.
- the method he has followed to obtain the details from the aggregated data is called drill through.
70.What is meant by Data Analytics?
- Data analytics (DA) is the science of examining raw data with the purpose of drawing conclusions about that information.
- A data warehouse is often built to enable Data Analytics
71.What is a table (entity)?
Data stored in form of rows and columns is called as table. Each column has datatype and based on the situation, integrity constraints are enforced on columns.
72.What is a column (attribute)?
Column also known as field is a vertical alignment of the data and contains related information to that column.
73.What is a row?
Row also known as tuple or record is the horizontal alignment of the data.
74.What is an artificial (derived) primary key? When should it be used?
Using a name as the primary key violates the principle of stability. The social security number might be a valid choice, but a foreign employee might not have a social security number. This is a case where a derived, rather than a natural, primary key is appropriate. A derived key is an artificial key that you create. A natural key is one that is already part of the database
75.What is a primary key constraint?
Primary key constraint is imposed on the column data to avoid null values and duplicate values. Primary Key=Unique + Not Null. Example: social security number, bank account number, bank routing number
76.What is a composite primary key constraint?
When more than one column is a part of the primary key, it is called as composite primary key constraint.
77.What is a foreign key constraint?
Parent table has primary key and a foreign key constraint is imposed on a column in the child table. The foreign key column value in the child table will always refer to primary key values in the parent table.
78.What is a composite foreign key constraint?
When group of columns are in a foreign key, it is called as composite foreign key constraint
79.What is a constraint? What are the different types of constraint?
Constraint is a rule imposed on the data. The different types of constraints are primary key, unique, not null, foreign key, composite foreign key, check constraint .
80.What is a unique constraint?
Unique constraint is imposed on the column data to avoid duplicate values, but it will contain NULL values.
81.What is a check constraint?
Check constraint is used to check range of values in a column
82.What is a sequence?
Sequence is a database object to generate unique number
- Product information
- sales information
- Data validation is to make sure that the loaded data is accurate and meets the business requirements .
- Strategies are different methods followed to meet the validation requirements
- The basic purpose of the scheduling tool in a DW Application is to stream line the flow of data from Source To Target at specific time or based on some condition.
86.Can we take backup when the database is opened?
No, We cannot take full backup when the database is opened.
87.What is defined as Partial Backup?
A Partial backup in an operating system is a backup short of full backup and it can be done while the database is opened or shutdown
88.What is the goal of Optimizer?
The goal to Optimizer is to find the most efficient way to execute the SQL statements.
89.What is Execution Plan?
Execution Plan is a plan which is used to the optimizer to select the combination of the steps.
90.What is the language that is used for schema definition?
Data Mining Query Language (DMQL) is used for schema definition.
Data Warehousing involves data cleaning, data integration and data consolidations.
Data extraction means gathering data from multiple heterogeneous sources
Multidimensional OLAP is faster than Relational OLAP.
Only one dimension is selected for the slice operation.
For dice operation two or more dimensions are selected for a given cube.
DMQL is based on Structured Query Language (SQL).
- A lookUp table is the one which is used when updating a warehouse.
- When the lookup is placed on the target table (fact table / warehouse) based upon the primary key of the target, it just updates the table by allowing only new records or updated records based on the lookup condition.
- On the fact table it is best to use bitmap indexes.
- Dimension tables can use bitmap and/or the other types of
- clustered/non-clustered,
- unique/non-unique indexes.
- Linked cube in which a sub-set of the data can be analysed into great detail.
- The linking ensures that the data in the cubes remain consistent.
- confirmed dimension
- standardized definition if facts.
- Business object store
- security information e.g user, group, access permission,
- user type etc. ,
- universe information e.g. objects, classes, table name, column name
- relationship etc.
- document information.
- Informatica
- Data Stage
- Oracle Warehouse Builder
- Ab Initio
- Data Junction
- Humming Bird Genio,
- Business Objects Data Integrator
- MS-SQL DTS (Integrated Services 2005)
- SQL Loader
- Sunopsis
- Erwin Computer Associates ,
- Embarcadero Embarcadero Technologies ,
- Rational Rose IBM Corporation ,
- Power Designer Sybase Corporation ,
- Oracle Designer Oracle Corporation
- Informatica (Cubes/Dimensions)
- ERW in (Entity Relationship for windows)
- MS-Excel ,
- Business Objects (Crystal Reports) ,
- Cognos(Impromptu, Power Play) ,
- Microstrategy ,
- MS reporting services ,
- Informatica Power Analyzer ,
- Actuate ,
- Hyperion (BRIO) ,
- Oracle Express OLAP ,
- Proclarity
105.What is an artificial (derived) primary key? When should it be used?
- Using a name as the primary key violates the principle of stability.
- The social security number might be a valid choice, but a foreign employee might not have a social security number.
- This is a case where a derived, rather than a natural, primary key is appropriate.
- A derived key is an artificial key that you create.
- A natural key is one that is already part of the database.
- Data mining is a process of extracting hidden trends within a datawarehouse.
- For example an insurance datawarehouse house can be used to mine data for the most high risk people to insure in a certain geographical area
- Data Mining is set to be a process of analyzing the data in different dimensions or perspectives and summarizing into a useful information. Can be queried and retrieved the data from database in their own format
107.What is VLDB in the context of data warehousing?
- VLDB is abbreviated as Very Large Database and its size is set to be more than one terabyte database.
- These are decision support systems which is used to server large number of users.
108.What are the approaches used by Optimizer during execution plan?
There are two approaches:
- Rule Based
- Cost Based
109.What needs to be done while starting the database?
Following need to be done to start the database:
- Start an Instance
- Mount the database
- Open the database
110.What needs to be done when the database is shutdown?
Following needs to be done when the database is shutdown:
- Close the database
- Dismount the database
- Shutdown the Instance
111.What is summary information?
Summary Information is the location within data warehouse where predefined aggregations are stored
The primary functions of the dimensions are as follows:
- Filtering
- Grouping
- Labelling
112.What is the difference between View and Materialized View?
View:
– Tail raid data representation is provided by a view to access data from its table.
– It has logical structure that does not occupy space.
– Changes get affected in corresponding tables.
Materialized view:
– Pre-calculated data persists in materialized view.
– It has physical data space occupation.
– Changes will not get affected in corresponding tables
The stages are IT strategy,
- Education
- Business Case Analysis
- technical Blueprint
- Build the version
- History Load
- Ad hoc query
- Requirement Evolution
- Automation
- and Extending Scope
114.Expand what DMQL stands for and what is the language that is used in DMQL?
DMQL stand for Data Mining Query Language.
- This language is used for schema definition.
- The language that is used in DMQL is nothing but SQL language.
- SQL stands for Structured Query Language.
OLAP performs functions such as
- roll-up
- drill-down
- slice
- dice and
- pivot
There are four types of OLAP servers, namely
- Relational OLAP
- Multidimensional OLAP
- Hybrid OLAP, and
- Specialized SQL Servers
The functions performed by Data warehouse tool and utilities are
- Data Extraction
- Data Cleaning
- Data Transformation
- Data Loading and
- Refreshing
Some applications include
- financial services
- banking services
- customer goods
- retail sectors
- controlled manufacturing
A data Warehouse can implement
- star schema
- snowflake schema and
- fact constellation schema
120.Common aggregate functions include :
- Average (i.e., arithmetic mean)
- Count
- Maximum
- Median
- Minimum
- Mode
- Range
- Sum
121.Purpose of cluster analysis :-
- Scalability
- Ability to deal with different kinds of attributes
- Discovery of clusters with attribute shape
- High dimensionality
- Ability to deal with noisy
- Interpretability
122.What is called data cleaning?
- Cleaning of Orphan records
- Data breaching business rules
- Inconsistent data and
- missing information in a database
- Agglomerative Hierarchical clustering method allows the clusters to be read from bottom to top so that the program always reads from the sub-component first then moves to the parent whereas
- Divisive Hierarchical clustering uses top-bottom approach in which the parent is visited first than the child.
- Agglomerative hierarchical method consists of objects in which each object creates its own clusters and these clusters are grouped together to create a large cluster. It defines a process of continuous merging until all the single clusters are merged together into a complete big cluster that will consist of all the objects of child clusters.
- divisive clustering, the parent cluster is divided into smaller cluster and it keeps on dividing until each cluster has a single object to represent.
- surrogate key is a substitution for the natural primary key.
- It is just a unique identifier or number for each row that can be used for the primary key to the table.
- The only requirement for a surrogate primary key is that it is unique for each row in the table.
- Data warehouses typically use a surrogate key for the dimension tables primary keys. They can use sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key.
- It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME
125.Classifying data for successful modeling
Qualitative Data
Qualitative data are also called categorical data as they represent distinct categories rather than numbers.In case of dimensional modeling, they are often termed as “dimension“.Mathematical operations such as addition or subtraction do not make any sense on that data.
Example of qualitative data are, eye color, zip code, phone number etc.
Qualitative data can be further classified into below classes:
NOMINAL :
Nominal data represents data where order of the data does not represent any meaningful information.
Consider your passport number.There is no information as such if your passport number is greater or lesser than some one else’s passport number. Consider Eye color of people, does not matter in which order we represent the eye colors, order does not matter.
ID, ZIP code, Phone number, eye color etc. are example of nominal class of qualitative data.
ORDINAL :
Order of the data is important for ordinal data.
Consider height of people – tall, medium, short.Although they are qualitative but the order of the attributes does matter, in the sense that they represent some comparative information.Similarly, letter grades, scale of 1-10 etc. are examples of Ordinal data.
In the field of dimensional modeling, this kind of data are sometimes referred as non-additive facts.
Quantitative data
Quantitative data are also called numeric data as they represent numbers.In case of dimensional data modeling approach, these data are termed as “Measure“.
Example of quantitative data is, height of a person, amount of goods sold, revenue etc.
Quantitative attributes can be further classified as below.
INTERVAL :
Interval classification is used where there is no true zero point in the data and division operation does not make sense.
Bank balance, temperature in Celsius scale, GRE score etc. are the examples of interval class data.Dividing one GRE score with another GRE score will not make any sense.In dimensional modeling this is synonymous to semi-additive facts.
RATIO :
Ratio class is applied on the data that has a true “zero” and where division does make sense.Consider revenue, length of time etc. These measures are generally additive.
Below table illustrates different actions that are possible to implement on various data types
ACTIONS –> | Distinct | Order | Addition | Multiplication |
---|---|---|---|---|
Nominal | Y | |||
Ordinal | Y | Y | ||
Interval | Y | Y | Y | |
Ratio | Y | Y | Y | Y |
126.Example of Decision Support System
Typical information that a decision support application might gather and present :
- An inventory of all of your current information assets (including legacy and relational data sources, cubes, data warehouses, and data marts)
- Comparative sales figures between one week and the next
- Projected revenue figures based on new product sales assumptions etc.
127.Components of a Decision Support System
DSS components may be classified as:
- Inputs: Factors, numbers, and characteristics to analyze. This is basically the raw data that you put to the system.
- User Knowledge and Expertise: Inputs requiring manual analysis by the user
- Outputs: Transformed data from which DSS “decisions” are generated
- Decisions: Results generated by the DSS based on user criteria
Detailed study:
https://dwbi.org/data-modelling/dw-design/7-decision-support-system-dss
128.Data Reconciliation
- Many of the data warehouses are built on n-tier architecture with multiple data extraction and data insertion jobs between two consecutive tiers.
- As it happens, the nature of the data changes as it passes from one tier to the next tier.
- Data reconciliation is the method of reconciling or tie-up the data between any two consecutive tiers (layers).
129.Why Reconciliation is required?
- In the process of extracting data from one source and then transforming the data and loading it to the next layer, the whole nature of the data can change considerably.
- It might also happen that some information is lost while transforming the data.
- A reconciliation process helps to identify such loss of information.
One of the major reasons of information loss is loading failures or errors during loading. Such errors can occur due to several reasons e.g.
- Inconsistent or non coherent data from source
- Non-integrating data among different sources
- Unclean / non-profiled data
- Un-handled exceptions
- Constraint violations
- Logical issues / Inherent flaws in program
- Technical failures like loss of connectivity, loss over network, space issue etc.
Failure due to any such issue can result into potential information loss leading to unreliable data quality for business process decision making.
Further more, if such issues are not rectified at the earliest, this becomes even more costly to “patch” later. Therefore this is highly suggested that a proper data reconciliation process must be in place in any data Extraction-Transformation-Load (ETL) process.
130.Methods of Data Reconciliation
Master Data Reconciliation
Master data reconciliation is the method of reconciling only the master data between source and target.
- Master data are generally unchanging or slowly changing in nature and no aggregation operation is done on the dataset.
- That is – the granularity of the data remains same in both source and target.
- That is why master data reconciliation is often relatively easy and quicker to implement.
In one business process, “customer”, “products”, “employee” etc. are some good example of master data.
Ensuring the total number of customer in the source systems match exactly with the total number of customers in the target system is an example of customer master data reconciliation.
Some of the common examples of master data reconciliation can be the following measures,
- Total count of rows, example
- Total Customer in source and target
- Total number of Products in source and target etc.
- Total count of rows based on a condition, example
- Total number of active customers
- Total number of inactive customers etc.
Transactional Data Reconciliation
- Sales quantity, revenue, tax amount, service usage etc. are examples of transactional data.
- Transactional data make the very base of BI reports so any mismatch in transactional data can cause direct impact on the reliability of the report and the whole BI system in general.
- That is why reconciliation mechanism must be in-place in order to detect such a discrepancy before hand (meaning, before the data reach to the final business users)
- Transactional data reconciliation is always done in terms of total sum.
- This prevents any mismatch otherwise caused due to varying granularity of qualifying dimensions.
- Also this total sum can be done on either full data or only on incremental data set.
Some examples measures used for transactional data reconciliation can be
- Sum of total revenue calculated from source and target
- Sum of total product sold calculated from source and target etc.
Automated Data Reconciliation
- For large warehouse systems, it is often convenient to automate the data reconciliation process by making this an integral part of data loading.
- This can be done by maintaining separate loading metadata tables and populating those tables with reconciliation queries.
- The existing reporting architecture of the warehouse can be then used to generate and publish reconciliation reports at the end of the loading.
- Such automated reconciliation will keep all the stake holders informed about the trustworthiness of the reports.
131.Scope of Data Reconciliation
- Data reconciliation is often confused with the process of data quality testing.
- Even worse, sometimes data reconciliation process is used to investigate and pin point the data issues.
- While data reconciliation may be a part of data quality assurance, these two things are not necessarily same.
- Scope of data reconciliation should be limited to identify, if at all, there is any issue in the data or not.
- The scope should not be extended to automate the process of data investigation and pin pointing the issues.
- A successful reconciliation process should only indicate whether or not the data is correct. It will not indicate why the data is not correct. Reconciliation process answers “what” part of the question, not “why” part of the question.
132.Top 5 Challenges of Data Warehousing
- Ensuring Acceptable Data Quality
- Disparate data sources add to data inconsistency
- Unstabilized source systems
- Ensuring acceptable Performance
- Prioritizing performance
- Setting realistic goal
- Performance by design
- Testing the data warehouse
- Test planning
- No automated testing
- given the lack of standardization in how the metadata are defined and design approaches are followed in different data warehousing projects.
- Reconciliation of data :
- Reconciliation is a process of ensuring correctness and consistency of data in a data warehouse. Unlike testing, which is predominantly a part of software development life cycle, reconciliation is a continuous process that needs to be carried out even after the development cycle is over.
- Reconciliation is complex
- User Acceptance
DETAILED STUDY :
https://dwbi.org/data-modelling/dw-design/11-top-5-challenges-of-data-warehousing
133.Volumetric mis-judgement ?
A very minutely done volumetric estimate in the starting phase of the project would go weary later. This happens due to several reasons e.g. slight change in the standard business metrics may create huge impact on the volumetric estimates.
For example, suppose a company has 1 million customers who are expected to grow at a rate of 7% per annum. While calculating the volume and size of your data warehouse you have used this measure in several places. Now if the customer base actually increase by 10% instead of 7%, that would mean 30000 more customers. In a fact table of granularity customer, product, day – this would mean 30000 X 10 X 365 more records ( assuming on average one customer use 10 products ). If one record takes 1kb, then the fact table would now require – ( 30000 X 365 X 10 X 1kb ) / ( 1024 X 1024 ) = 100+ GB more disk space from only one table.
134.What is data retention and Purging?
There are certain requirement to purge / archive / delete the data in data warehouse after a certain period of time, often termed as retention period of the data warehouse. Once a the retention period is reached, data from the data warehouse are purged or deleted or archived into separate place usually comprising of low cost storage medium (e.g. tape drive).
135.Why data purging is required?
In a idealistic scenario, we assume data warehouse to store data for good. However there are some reasons why this might not be a good idea in a real scenario:
- There are cost overhead associated with the amount of data that we store. This includes the cost of storage medium, infrastructure and human resources necessary to manage the data
- There is direct impact of data volume to the performance of a data warehouse. More data means more time consuming sorting and searching operations
- End users of the data warehouse in the business side may not be interested in the very old fact and figures.
- Data might lose its importance and relevance with the changing business landscape. Retaining such impertinent data may not be required
136.Variable purging requirement
- The concept of data purging may not be applicable for all types of data in the warehouse.
- Typically one may only want to purge the transactional data and may not want to purge any master data.
- Similarly the retention period of the data warehouse may not be same for all types of data.
- One may set a variable retention period such that all detail level transactional data will be purged after 2 years and all aggregated transaction data will be purged after 7 years.
137.Rolling window for data purging
- Process of purging is typically implemented as a rolling window wherein whatever data falls under the purging window are only purged.
- So suppose if the retention period is set as 2 years and purging process is executed bi-monthly then after every 15 days, whenever the purging process runs it deletes all the records older than 2 years as of that day.
138.Purging Master / Reference data
Purging of master data is not recommended unless:
- There is specific reason why the master data will never be required in the future
- History is maintained for the change in the master data (e.g. in the form of Slowly Changing Dimension etc.)
If history of the changes in the master data are maintained in the data warehouse, it might be necessary at some point to purge the records corresponding to old histories of the master data. When this is done, one must also keep in mind to clear all the corresponding transactional records pertaining to the master data. To understand the situation, consider this: I have a customer C1 for whom I have two records in my master data table with surrogate keys as K1 and K2. K1 corresponds to the customer C1’s attributes as of year 1996 and K2 corresponds to the customer C1’s attribute as of year 2000.
Surrogate Key | Customer Name | Customer Tier | Year |
---|---|---|---|
K1 | C1 | Gold | 1996 |
K2 | C1 | Platinum | 2000 |
Now suppose there are 10 transactions pertaining to this customer C1 in the year 1996. In this circumstance, if we decide to purge the master record K1, then all the 10 transaction records would become orphan (meaning the key integrity between the master and transaction detail records will break). Ideally one must also purge these 10 transaction records while purging the corresponding master record.
In certain Relational databases (e.g. Oracle), this issue can be automatically taken care if foreign key constraints are enabled in the transaction detail table with “ON DELETE CASCADE” option.
138.Purging Transactional data
Transactional data purging may be necessary when
- Corresponding master data records (or parent record) are also purged
- Data crosses the designated retention period
As discussed above, there can be variable period of retention for the transaction data. Aggregated transaction data may be retained for a longer period of time than the detailed transactions.
139.Challenges faced during data warehouse data purging
Time. Purging takes time. And the process of purging may also slowdown legitimate daily activities of the data warehouse (e.g. loading may get slower/report may take more time to refresh etc.) Therefore one must plan and allocate enough time when the purging activity can be done. Weekends, holidays etc. can be a good time for data purging as number of users accessing the data warehouse are typically less at those times.
On the technical front one must also consider post purging activities that are required to be performed in the data warehouse. Since purging frees up lot of space – additional administrative activities are required to be performed to make use of newly freed-up space. This includes, but not limited to activities such as:
- Tablespace re-organization
- Index rebuilding
- Defragmentation / consolidation etc.
140.What is Data Integration (DI)?
Data Integration is the process of combining heterogeneous data sources in to a single queriable schema so as to get an unified view of these data.
141.Why Data Integration is required?
- Often large companies and enterprises maintain separate departmental databases to store the data pertaining to the specific department.
- Although such separations of the data provide them better manageability and security, performing any cross departmental analysis on these datasets becomes impossible.
For example, if marketing department and sales department maintain two secluded databases, then it might not be possible to analyze the effect of a certain advertising campaign by the marketing department on sales of a product. Similarly, if HR department and production department maintain their individual databases, it might not be possible to analyze the correlation between yearly incentives and employee’s productivity.
Data integration provides a mechanism to integrate these data from different departments into a single queriable schema.
Below is a list of examples where data integration is required. The list, however, is not comprehensive
- Cross functional analysis – as discussed in the above example
- Finding correlation – Statistical intelligence/scientific application
- Sharing information – legal or regulatory requirements e.g. sharing customers’ credit information among banks
- Maintaining single point of truth – Higher management topping over several departments may need to see a single picture of the business
- Merger of Business – after merger two companies want to aggregate their individual data assets
142.How data integration can be done? – different approaches to data integration
There are mainly 2 major approaches for data integration – commonly known as
- “tight coupling approach”
- “loose coupling approach”.
Tight Coupling: Data Warehousing
- In case of tight coupling approach – which is often implemented through data warehousing, data is pulled over from disparate sources into a single physical location through the process of ETL – Extraction, Transformation and Loading.
- The single physical location provides an uniform interface for querying the data.
- ETL layer helps to map the data from the sources so as to provide a semantically uniform data warehouse.
- This approach is called tight coupling since in this approach the data is tightly coupled with the physical repository at the time of query.
Loose Coupling: Virtual Mediated Schema
- In contrast to tight coupling approach, a virtual mediated schema provides a interface that takes the query input from the user, transform the query in the way source database can understand and then sends the query directly to the source databases to obtain the result.
- In this approach, the data does not really remain in the schema and only remain in the actual source databases.
- However, mediated schema contains several “adapters” or “wrappers” that can connect back to the source systems in order to bring the data to the front end.
- This approach is often implemented through middleware architecture (EAI).
To understand “Virtual Mediated Schema” approach, consider this example. Let’s say one database stores weather data of a country for past 50 years. Another database contains crop production data of the country for each year. A user might want to query – “In which year crop production is lowest inspite of more than average rainfall?”. The question can be subdivided into 3 different questions as follows:
- What is the average rainfall across all the years?
- Which are the years where recorded rainfall was higher than the average?
- Out of all the years obtained from the above query, which year has least crop production?
The first 2 queries can be straightway sent to the weather database and the last query can be sent to the crop production database to get the answers.
In a mediated schema all the data of weather and crop productions will not be available in the schema itself, but the schema will have necessary “adapters” to send the queries back to appropriate databases and combine the result sets.
143.Why do we need Staging Area during ETL Load
We have a simple data warehouse that takes data from a few RDBMS source systems and load the data in dimension and fact tables of the warehouse.
why you can’t avoid a staging area:
- Source systems are only available for extraction during a specific time slot which is generally lesser than your overall data loading time.
- It’s a good idea to extract and keep things at your end before you lose the connection to the source systems.
- You want to extract data based on some conditions which require you to join two or more different systems together. E.g. you want to only extract those customers who also exist in some other system. You will not be able to perform a SQL query joining two tables from two physically different databases.
- Various source systems have different allotted timing for data extraction.
- Data warehouse’s data loading frequency does not match with the refresh frequencies of the source systems.
- Extracted data from the same set of source systems are going to be used in multiple places (data warehouse loading, ODS loading, third-party applications etc.)
- ETL process involves complex data transformations that require extra space to temporarily stage the data
- There is specific data reconciliation / debugging requirement which warrants the use of staging area for pre, during or post load data validations
- impact of having a stage area? Yes there are a few.
- Staging area increases latency – that is the time required for a change in the source system to take effect in the data warehouse. In lot of real time / near real time applications, staging area is rather avoided.
- Data in the staging area occupies extra space.
144.Methods of Incremental Loading in Data Warehouse
- Incremental loading a.k.a Delta loading is an widely used method to load data in data warehouses from the respective source systems.
- This technique is employed to perform faster load in less time utilizing less system resources.
145.What is Incremental Loading and why is it required
- we extract data from one or more source systems and keep storing them in the data warehouse for future analysis.
- The source systems are generally OLTP systems which store everyday transactional data.
Now when it comes to loading these transactional data to data warehouse, we have 2 ways to accomplish this, Full Load or Incremental Load.
146.INCREMENTAL LOAD METHOD FOR LOADING DATA WAREHOUSE
We can make use of “entry date” field in the customer table and “sales date” field in the sales table to keep track of this. After each loading we will “store” the date until which the loading has been performed in some data warehouse table and next day we only extract those records that has a date greater than our stored date. Let’s create a new table to store this date. We will call this table as “Batch”
Batch
Batch_ID Loaded_Until Status 1 22-Mar-2012 Success 2 23-Mar-2012 Success
Once we have done this, all we have to do to perform incremental or delta loading is to rite our data extraction SQL queries in this format
Next day, the query “SELECT max(b.loaded_until)” will return me 22-Mar-2012. So in effect, above queries will reduce to this:
147.Why status field is created in batch table?
This is because it might so happen that 23rd load has failed. So when we start loading again on 24th, we must take into consideration both 23rd data and 24th data.
Batch_ID Loaded_Until Status 1 22-Mar-2012 Success 2 23-Mar-2012 Fail 3 24-Mar-2012 Success
In the above case, 23rd batch load was a failure. That is why next day we have selected all the data after 22-Mar (including 23rd and 24th Mar).