ETL

1. Compare ETL Testing & Manual Testing
Criteria ETL Testing Manual testing
Basic procedure Writing scripts for automating testing process Seeing and testing method
Requirements No need of additional technical knowledge other than software Need technical knowledge of SQL and shell scripting
Efficiency Fast, systematic and gives top results Needs time, effort and prone to errors
2. What is ETL?

ETL refers to Extracting, Transforming and Loading of Data from any outside system to the required place. These are the basic 3 steps in the Data Integration process. Extracting means locating the Data and removing from the source file, transforming is the process of transporting it to the required target file and Loading the file in the target system in the format applicable.

3. Why ETL testing is required?
  • To keep a check on the Data which are being transferred from one system to the other.
  • To keep a track on the efficiency and speed of the process.
  • To be well acquainted with the ETL process before it gets implemented into your business and production.

Click here, to learn more about ETL testing process.

4. What are ETL tester responsibilities?
  • Requires in depth knowledge on the ETL tools and processes.
  • Needs to write the SQL queries for the various given scenarios during the testing phase.
  • Should be able to carry our different types of tests such as Primary Key, defaults and keep a check on the other functionality of the ETL process.
  • Quality Check
5. What are the various tools used in ETL?
  • Cognos Decision Stream
  • Oracle Warehouse Builder
  • Business Objects XI
  • SAS business warehouse
  • SAS Enterprise ETL server

Download ETL Testing Interview Questions asked by top MNCs in 2017

GET PDF
8. Mention what are the types of data warehouse applications and what is the difference between data mining and data warehousing?

The types of data warehouse applications are:

  •  Info Processing
  • Analytical Processing
  • Data Mining
    –>Data mining can be define as the process of extracting hidden predictive information from large databases and interpret the data while data warehousing may make use of a data mine for analytical processing of the data in a faster way. Data warehousing is the process of aggregating data from multiple sources into one common repository.

Click here to learn more about ETL Testing, in this ETL Testing training course.

9. What is fact? What are the types of facts?

It is a central component of a multi-dimensional model which contains the measures to be analyzed. Facts are related to dimensions.
Types of facts are:

  •  Additive Facts
  • Semi-additive Facts
  • Non-additive Facts
10. Explain what are Cubes and OLAP Cubes?

Cubes are data processing units comprised of fact tables and dimensions from the data warehouse.  It provides multi-dimensional analysis. OLAP stands for Online Analytics Processing, and OLAP cube stores large data in multi-dimensional form for reporting purposes.  It consists of facts called as measures categorized by dimensions.

 

11. List few ETL bugs.
  •  Calculation Bug
  • User Interface Bug
  • Source Bugs
  •  Load condition bug
  •  ECP related bug
    In addition to the above ETL testing questions, there may be other vital questions where you may be asked to mention the ETL tools which you have used earlier. Also, you might be asked about any debugging issues you have faced in your earlier experience or about any real time experience.

7) Explain what is tracing level and what are the types?

Tracing level is the amount of data stored in the log files. Tracing level can be classified in two Normal and Verbose. Normal level explains the tracing level in a detailed manner while verbose explains the tracing levels at each and every row.

8) Explain what is Grain of Fact?

Grain fact can be defined as the level at which the fact information is stored. It is also known as Fact Granularity

9) Explain what factless fact schema is and what is Measures?

A fact table without measures is known as Factless fact table. It can view the number of occurring events. For example, it is used to record an event such as employee count in a company.

The numeric data based on columns in a fact table is known as Measures

10) Explain what is transformation?

A transformation is a repository object which generates, modifies or passes data. Transformation are of two types Active and Passive

11) Explain the use of Lookup Transformation?

The Lookup Transformation is useful for

  • Getting a related value from a table using a column value
  • Update slowly changing dimension table
  • Verify whether records already exist in the table

12) Explain what is partitioning, hash partitioning and round robin partitioning?

To improve performance, transactions are sub divided, this is called as Partitioning. Partioning enables Informatica Server for creationg of multiple connection to various sources

The types of partitions are

Round-Robin Partitioning:

  • By informatica data is distributed evenly among all partitions
  • In each partition where the number of rows to process are approximately same this partioning is applicable

Hash Partitioning:

  • For the purpose of partitioning keys to group data among partitions Informatica server applies a hash function
  • It is used when ensuring the processes groups of rows with the same partitioning key in the same partition need to be ensured

13) Mention what is the advantage of using DataReader Destination Adapter?

The advantage of using the DataReader Destination Adapter is that it populates an ADO recordset (consist of records and columns) in memory and exposes the data from the DataFlow task by implementing the DataReader interface, so that other application can consume the data.

14) Using SSIS ( SQL Server Integration Service) what are the possible ways to update table?

To update table using SSIS the possible ways are:

  • Use a SQL command
  • Use a staging table
  • Use Cache
  • Use the Script Task
  • Use full database name for updating if MSSQL is used

15) In case you have non-OLEDB (Object Linking and Embedding Database) source for the lookup what would you do?

In case if you have non-OLEBD source for the lookup then you have to use Cache to load data and use it as source

16) In what case do you use dynamic cache and static cache in connected and unconnected transformations?

  • Dynamic cache is used when you have to update master table and slowly changing dimensions (SCD) type 1
  • For flat files Static cache is used

17) Explain what are the differences between Unconnected and Connected lookup?

Connected Lookup Unconnected Lookup
  • Connected lookup participates in mapping
– It is used when lookup function is used instead of an expression transformation while mapping
  • Multiple values can be returned
– Only returns one output port
  • It can be connected to another transformations and returns a value
  • Another transformation cannot be connected
  • Static or dynamic cache can be used for connected Lookup
  • Unconnected as only static cache
  • Connected lookup supports user defined default values
  • Unconnected look up does not support user defined default values
  • In Connected Lookup multiple column can be return from the same row or insert into dynamic lookup cache
  • Unconnected lookup designate one return port and returns one column from each row

18) Explain what is data source view?

A data source view allows to define the relational schema which will be used in the analysis services databases. Rather than directly from data source objects, dimensions and cubes are created from data source views.

19) Explain what is the difference between OLAP tools and ETL tools ?

The difference between ETL and OLAP tool is that

ETL tool is meant for the extraction of data from the legacy systems and load into specified data base with some process of cleansing data.

Example: Data stage, Informatica etc.

While OLAP is meant for reporting purpose in OLAP data available in multi-directional model.

Example: Business Objects, Cognos etc.

20) How you can extract SAP data using Informatica?

  • With the power connect option you extract SAP data using informatica
  • Install and configure the PowerConnect tool
  • Import the source into the Source Analyzer. Between Informatica and SAP Powerconnect act as a gateaway. The next step is to generate the ABAP code for the mapping then only informatica can pull data from SAP
  • To connect and import sources from external systems Power Connect is used

21) Mention what is the difference between Power Mart and Power Center?

Power Center Power Mart
  • Suppose to process huge volume of data
  • Suppose to process low volume of data
  • It supports ERP sources such as SAP, people soft etc.
  • It does not support ERP sources
  • It supports local and global repository
  • It supports local repository
  • It converts local into global repository
  • It has no specification to convert local into global repository

22) Explain what staging area is and what is the purpose of a staging area?

Data staging is an area where you hold the data temporary on data warehouse server. Data staging includes following steps

  • Source data extraction and data transformation ( restructuring )
  • Data transformation (data cleansing, value transformation )
  • Surrogate key assignments

23) What is Bus Schema?

For the various business process to identify the common dimensions, BUS schema is used. It comes with a conformed dimensions along with a standardized definition of information

24) Explain what is data purging?

Data purging is a process of deleting data from data warehouse. It deletes junk data’s like rows with null values or extra spaces.

25) Explain what are Schema Objects?

Schema objects are the logical structure that directly refer to the databases data. Schema objects includes tables, views, sequence synonyms, indexes, clusters, functions packages and database links

26) Explain these terms Session, Worklet, Mapplet and Workflow ?

  • Mapplet : It arranges or creates sets of transformation
  • Worklet: It represents a specific set of tasks given
  • Workflow: It’s a set of instructions that tell the server how to execute tasks
  • Session: It is a set of parameters that tells the server how to move data from sources to targe

Q #5.  What is Staging area referring to?
Ans. Staging area is the place where the data is stored temporarily in the process of Data Integration. Here, the data s cleansed and checked for any duplication.

Q #6.Explain ETL Mapping Sheets.
Ans. ETL mapping sheets contains all the required information from the source file including all the rows and columns. This sheet helps the experts in writing the SQL queries for the ETL tools testing.

Q #7. Mention few Test cases and explain them.
Ans.

  • Mapping Doc Validation – Verifying if the ETL information is provided in the Mapping Doc.
  • Data Check – Every aspect regarding the Data such as Data check, Number Check, Null check are tested in this case
  • Correctness Issues – Misspelled Data, Inaccurate data and null data are tested.

3. What is the difference between and ETL and BI tools?

An ETL tool is used to extract data from different data sources, transform the data, and load it into a DW system. In contrast, a BI tool is used to generate interactive and adhoc reports for end-users, dashboard for senior management, data visualizations for monthly, quarterly, and annual board meetings.

Most common ETL tools include − SAP BO Data Services (BODS), Informatica, Microsoft – SSIS, Oracle Data Integrator ODI, Talend Open Studio, Clover ETL Open source, etc.

Most common BI tools include − SAP Business Objects, SAP Lumira, IBM Cognos, JasperSoft, Microsoft BI Platform, Tableau, Oracle Business Intelligence Enterprise Edition, etc.

4. What are the popular ETL tools available in the market?

The popular ETL tools available in the market are −

  • Informatica − Power Center
  • IBM − Websphere DataStage (Formerly known as Ascential DataStage)
  • SAP − Business Objects Data Services BODS
  • IBM − Cognos Data Manager (Formerly known as Cognos Decision Stream)
  • Microsoft − SQL Server Integration Services SSIS
  • Oracle − Data Integrator ODI (Formerly known as Sunopsis Data Conductor)
  • SAS − Data Integration Studio
  • Oracle − Warehouse Builder
  • ABInitio
  • Open source Clover ETL

 

7. What are the structural differences between an OLTP and OLAP system?

OLTP stands for Online Transactional Processing system which is commonly a relational database and is used to manage day-to-day transactions.

OLAP stands for Online Analytical Processing system which is commonly a multidimensional system and is also called data warehouse.

8. What is a Dimension table and how is it different from a Fact table?

Suppose a company sells its products to customers. Every sale is a fact that takes place within the company and the fact table is used to record these facts. Each fact table stores the primary keys to join the fact table to dimension tables and measures/facts.

Example − Fact_Units

Cust_ID Prod_Id Time_Id No. of units sold
101 24 1 25
102 25 2 15
103 26 3 30

A dimension table stores attributes or dimensions that describe the objects in a fact table. It is a set of companion tables to a fact table.

Example − Dim_Customer

Cust_id Cust_Name Gender
101 Jason M
102 Anna F

9. What is a Data Mart?

A data mart is a simple form of data warehouse and it is focused on a single functional area. It usually gets data only from a few sources.

Example − In an organization, data marts may exists for Finance, Marketing, Human Resource, and other individual departments which store data related to their specific functions.

10. What is an Aggregate function? Name a few common aggregate functions.

Aggregate functions are used to group multiple rows of a single column to form a more significant measurement. They are also used for performance optimization when we save aggregated tables in data warehouse.

Common Aggregate functions are −

MIN returns the smallest value in a given column
MAX returns the largest value in a given column
SUM returns the sum of the numeric values in a given column
AVG returns the average value of a given column
COUNT returns the total number of values in a given column
COUNT(*) returns the number of rows in a table

Example

SELECT AVG(salary) 
FROM employee 
WHERE title = 'developer'; 

11. Explain the difference between DDL, DML, and DCL statements.

Data Definition Language (DDL) statements are used to define the database structure or schema.

Examples

  • CREATE − to create objects in a database
  • ALTER − alters the structure of a database

Data Manipulation Language (DML) statements are used for manipulate data within database.

Examples

  • SELECT − retrieves data from the a database
  • INSERT − inserts data into a table
  • UPDATE − updates existing data within a table
  • DELETE − deletes all records from a table, the space for the records remain

Data Control Language (DCL) statements are used to control access on database objects.

Examples

  • GRANT − gives user’s access privileges to database
  • REVOKE − withdraws access privileges given with the GRANT command

12. What is an Operator in SQL? Explain common operator types.

Operators are used to specify conditions in an SQL statement and to serve as conjunctions for multiple conditions in a statement. The common operator types are −

  • Arithmetic Operators
  • Comparison/Relational Operators
  • Logical Operators
  • Set Operators
  • Operators used to negate conditions

13. What are the common set operators in SQL?

The common set operators in SQL are −

  • UNION
  • UNION ALL
  • INTERSECT
  • MINUS

14. What is the difference between Minus and Intersect? What is their use in ETL testing?

Intersect operation is used to combine two SELECT statements, but it only returns the records which are common from both SELECT statements. In case of Intersect, the number of columns and datatype must be same. MySQL does not support INTERSECT operator. An Intersect query looks as follows −

select * from First 
INTERSECT 
select * from second 

Minus operation combines result of two Select statements and return only those result which belongs to first set of result. A Minus query looks as follows −

select * from First 
MINUS 
select * from second 

If you perform source minus target and target minus source, and if the minus query returns a value, then it should be considered as a case of mismatching rows.

If the minus query returns a value and the count intersect is less than the source count or the target table, then the source and target tables contain duplicate rows.

15. Explain ‘Group-by’ and ‘Having’ clause with an example.

Group-by clause is used with select statement to collect similar type of data. HAVING is very similar to WHERE except the statements within it are of an aggregate nature.

Syntax

SELECT dept_no, count ( 1 ) FROM employee GROUP BY dept_no;  
SELECT dept_no, count ( 1 ) FROM employee GROUP BY dept_no HAVING COUNT( 1 ) > 1;

Example − Employee table

Country Salary
India 3000
US 2500
India 500
US 1500

Group by Country

Country Salary
India 3000
India 500
US 2500
US 1500

16. What do you understand by ETL Testing?

ETL Testing is done before data is moved into a production Data Warehouse system. It is sometimes also called as Table Balancing or production reconciliation.

The main objective of ETL testing is to identify and mitigate data defects and general errors that occur prior to processing of data for analytical reporting.

17. How ETL Testing is different from database testing?

The following table captures the key features of Database and ETL testing and their comparison −

Function Database Testing ETL Testing
Primary Goal Data validation and Integration Data Extraction, Transform and Loading for BI Reporting
Applicable System Transactional system where business flow occurs System containing historical data and not in business flow environment
Common Tools in market QTP, Selenium, etc. QuerySurge, Informatica, etc.
Business Need It is used to integrate data from multiple applications, Severe impact. It is used for Analytical Reporting, information and forecasting.
Modeling ER method Multidimensional
Database Type It is normally used in OLTP systems It is applied to OLAP systems
Data Type Normalized data with more joins De-normalized data with less joins, more indexes and Aggregations.

21. What do you understand by the term ‘transformation’?

A transformation is a set of rules which generates, modifies, or passes data. Transformation can be of two types − Active and Passive.

22. What do you understand by Active and Passive Transformations?

In an active transformation, the number of rows that is created as output can be changed once a transformation has occurred. This does not happen during a passive transformation. The information passes through the same number given to it as input.

23. What is Partitioning? Explain different types of partitioning.

Partitioning is when you divide the area of data store in parts. It is normally done to improve the performance of transactions.

If your DW system is huge in size, it will take time to locate the data. Partitioning of storage space allows you to find and analyze the data easier and faster.

Parting can be of two types − round-robin partitioning and Hash partitioning.

24. What is the difference between round-robin partitioning and Hash partitioning?

In round-robin partitioning, data is evenly distributed among all the partitions so the number of rows in each partition is relatively same. Hash partitioning is when the server uses a hash function in order to create partition keys to group the data.

25. Explain the terms − mapplet, session, mapping, workflow − in an ETL process?

  • A Mapplet defines the Transformation rules.
  • Sessions are defined to instruct the data when it is moved from source to target system.
  • A Workflow is a set of instructions that instructs the server on task execution.
  • Mapping is the movement of data from the source to the destination.

26. What is lookup transformation and when is it used?

Lookup transformation allows you to access data from relational tables which are not defined in mapping documents. It allows you to update slowly changing dimension tables to determine whether the records already exist in the target or not.

27. What is a surrogate key in a database?

A Surrogate key is something having sequence-generated numbers with no meaning, and just to identify the row uniquely. It is not visible to users or application. It is also called as Candidate key.

28. What is the difference between surrogate key and primary key?

A Surrogate key has sequence-generated numbers with no meaning. It is meant to identify the rows uniquely.

A Primary key is used to identify the rows uniquely. It is visible to users and can be changed as per requirement.

29. If there are thousands of records in the source system, how do you ensure that all the records are loaded to the target in a timely manner?

In such cases, you can apply the checksum method. You can start by checking the number of records in the source and the target systems. Select the sums and compare the information.

30. What do you understand by Threshold value validation Testing? Explain with an example.

In this testing, a tester validates the range of data. All the threshold values in the target system are to be checked to ensure they are as per the expected result.

Example − Age attribute shouldn’t have a value greater than 100. In Date column DD/MM/YY, month field shouldn’t have a value greater than 12.

31. Write an SQL statement to perform Duplicate Data check Testing.

Select Cust_Id, Cust_NAME, Quantity, COUNT (*)
   FROM Customer GROUP BY Cust_Id, Cust_NAME, Quantity HAVING COUNT (*) >1;

32. How does duplicate data appear in a target system?

When no primary key is defined, then duplicate values may appear.

Data duplication may also arise due to incorrect mapping, and manual errors while transferring data from source to target system.

33. What is Regression testing?

Regression testing is when we make changes to data transformation and aggregation rules to add a new functionality and help the tester to find new errors. The bugs that appear in data which comes in Regression testing are called Regression.

34. Name the three approaches that can be followed for system integration.

The three approaches are − top-down, bottom-up, and hybrid.

 

35. What is data purging?

Data purging is a process of deleting data from a data warehouse. It removes junk data like rows with null values or extra spaces.

36. What do you understand by a cosmetic bug in ETL testing?

Cosmetic bug is related to the GUI of an application. It can be related to font style, font size, colors, alignment, spelling mistakes, navigation, etc.

37. What do you call the testing bug that comes while performing threshold validation testing?

It is called Boundary Value Analysis related bug.

38. I have 50 records in my source system but I want to load only 5 records to the target for each run. How can I achieve this?

You can do it by creating a mapping variable and a filtered transformation. You might need to generate a sequence in order to have the specifically sorted record you require.

39. Name a few checks that can be performed to achieve ETL Testing Data accuracy.

Value comparison − It involves comparing the data in the source and the target systems with minimum or no transformation. It can be done using various ETL Testing tools such as Source Qualifier Transformation in Informatica.

Critical data columns can be checked by comparing distinct values in source and target systems.

40. Which SQL statements can be used to perform Data completeness validation?

You can use Minus and Intersect statements to perform data completeness validation. When you perform source minus target and target minus source and the minus query returns a value, then it is a sign of mismatching rows.

If the minus query returns a value and the count intersect is less than the source count or the target table, then duplicate rows exist.

41. What is the difference between shortcut and reusable transformation?

Shortcut Transformation is a reference to an object that is available in a shared folder. These references are commonly used for various sources and targets which are to be shared between different projects or environments.

In the Repository Manager, a shortcut is created by assigning ‘Shared’ status. Later, objects can be dragged from this folder to another folder. This process allows a single point of control for the object and multiple projects do not have all import sources and targets into their local folders.

Reusable Transformation is local to a folder. Example − Reusable sequence generator for allocating warehouse Customer ids. It is useful to load customer details from multiple source systems and allocating unique ids to each new source-key.

42. What is Self-Join?

When you join a single table to itself, it is called Self-Join.

43. What do you understand by Normalization?

Database normalization is the process of organizing the attributes and tables of a relational database to minimize data redundancy.

Normalization involves decomposing a table into less redundant (and smaller) tables but without losing information.

44. What do you understand by fact-less fact table?

A fact-less fact table is a fact table that does not have any measures. It is essentially an intersection of dimensions. There are two types of fact-less tables: One is for capturing an event, and the other is for describing conditions.

45. What is a slowly changing dimension and what are its types?

Slowly Changing Dimensions refer to the changing value of an attribute over time. SCDs are of three types − Type 1, Type 2, and Type 3.

46. User A is already logged into the application and User B is trying to login, but the system is not allowing. Which type of bug is it?

a − Race Condition bug

b − Calculation bug

c − Hardware bug

d − Load Condition bug

Answer − d

47. Which testing type is used to check the data type and length of attributes in ETL transformation?

a − Production Validation Testing

b − Data Accuracy Testing

c − Metadata Testing

d − Data Transformation testing

Answer − c

48. Which of the following statements is/are not true on the Referential join?

a − It is only used when referential integrity between both tables is guaranteed.

b − It is only used if a filter is set on the right side table

c − It is considered as optimized Inner join.

d − It is only executed when fields from both the tables are requested

Answer − b

49. Which file contains information about configuration of dataset in ETL system?

a − Data File

b − Configuration File

c − Descriptor File

d − Control File

Answer − c

50. Which bug type in ETL testing doesn’t allow you to enter valid values?

a − Load Condition bugs

b − Calculation bugs

c − Race condition bug

d − Input/ Output bug

Answer − d

 

ETL Testing – Scenarios

Q #8. List few ETL bugs.
Ans.  Calculation Bug, User Interface Bug, Source Bugs, Load condition bug, ECP related bug.

2. Explain the 3-layer architecture of an ETL cycle.

The three layers involved in an ETL cycle are −

  • Staging Layer − The staging layer is used to store the data extracted from different source data systems.
  • Data Integration Layer − The integration layer transforms the data from the staging layer and moves the data to a database, where the data is arranged into hierarchical groups, often called dimensions, and into facts and aggregate facts. The combination of facts and dimensions tables in a DW system is called a schema.
  • Access Layer − The access layer is used by end-users to retrieve the data for analytical reporting

18. What are the different ETL Testing categories as per their function?

ETL testing can be divided into the following categories based on their function −

  • Source to Target Count Testing − It involves matching of count of records in source and target system.
  • Source to Target Data Testing − It involves data validation between source and target system. It also involves data integration and threshold value check and Duplicate data check in target system.
  • Data Mapping or Transformation Testing − It confirms the mapping of objects in source and target system. It also involves checking functionality of data in target system.
  • End-User Testing − It involves generating reports for end users to verify if data in reports are as per expectation. It involves finding deviation in reports and cross check the data in target system for report validation.
  • Retesting − It involves fixing the bugs and defects in data in target system and running the reports again for data validation.
  • System Integration Testing − It involves testing all the individual systems, and later combine the result to find if there is any deviation.

19. Explain the key challenges that you face while performing ETL Testing.

  • Data loss during the ETL process.
  • Incorrect, incomplete or duplicate data.
  • DW system contains historical data so data volume is too large and really complex to perform ETL testing in target system.
  • ETL testers are normally not provided with access to see job schedules in ETL tool. They hardly have access on BI Reporting tools to see final layout of reports and data inside the reports.
  • Tough to generate and build test cases as data volume is too high and complex.
  • ETL testers normally doesn’t have an idea of end user report requirements and business flow of the information.
  • ETL testing involves various complex SQL concepts for data validation in target system.
  • Sometimes testers are not provided with source to target mapping information.
  • Unstable testing environment results delay in development and testing the process.

20. What are your responsibilities as an ETL Tester?

The key responsibilities of an ETL tester include −

  • Verifying the tables in the source system − Count check, Data type check, keys are not missing, duplicate data.
  • Applying the transformation logic before loading the data: Data threshold validation, surrogate ky check, etc.
  • Data Loading from the Staging area to the target system: Aggregate values and calculated measures, key fields are not missing, Count Check in target table, BI report validation, etc.
  • Testing of ETL tool and its components, Test cases − Create, design and execute test plans, test cases, Test ETL tool and its function, Test DW system, etc.
6. Define the ETL processing?

ETL Testing Process:
Although there are many ETL tools, there is a simple testing process which is commonly used in ETL testing. It is as important as the implementation of ETL tool into your business. Having a well defined ETL testing strategy can make the testing process much easier. Hence, this process need to be followed before you start the Data Integration processed with the selected ETL tool. In this ETL testing process, a group of experts comprising the programming and developing team will start writing SQL statements. The development team may customize according to the requirements.
ETL testing process is:
Analyzing the requirement – Understanding the business structure and their particular requirement.
Validation and Test Estimation – An estimation of time and expertise required to carry on with the procedure.
Test Planning and Designing the testing environment – Based on the inputs from the estimation, an ETL environment is planned and worked out.
Test Data preparation and Execution – Data for the test is prepared and executed as per the requirement.
Summary Report:  Upon the completion of the test run, a brief summary report is prepared for improvising and concluding.

 

7. Explain what are the ETL testing operations includes?

ETL testing includes

  •  Verify whether the data is transforming correctly according to business requirements
  • Verify that the projected data is loaded into the data warehouse without any truncation and data loss
  • Make sure that ETL application reports invalid data and replaces with default values
  •  Make sure that data loads at expected time frame to improve scalability and performance

34. What are the common ETL Testing scenarios?

The most common ETL testing scenarios are −

  • Structure validation
  • Validating Mapping document
  • Validate Constraints
  • Data Consistency check
  • Data Completeness Validation
  • Data Correctness Validation
  • Data Transform validation
  • Data Quality Validation
  • Null Validation
  • Duplicate Validation
  • Date Validation check
  • Full Data Validation using minus query
  • Other Test Scenarios
  • Data Cleaning

ETL Test Scenarios are used to validate an ETL Testing Process. The following table explains some of the most common scenarios and test-cases that are used by ETL testers.

Test Scenarios Test-Cases
Structure Validation It involves validating the source and the target table structure as per the mapping document.

Data type should be validated in the source and the target systems.

The length of data types in the source and the target system should be same.

Data field types and their format should be same in the source and the target system.

Validating the column names in the target system.

Validating Mapping document It involves validating the mapping document to ensure all the information has been provided. The mapping document should have change log, maintain data types, length, transformation rules, etc.
Validate Constraints It involves validating the constraints and ensuring that they are applied on the expected tables.
Data Consistency check It involves checking the misuse of integrity constraints like Foreign Key.

The length and data type of an attribute may vary in different tables, though their definition remains same at the semantic layer.

Data Completeness Validation It involves checking if all the data is loaded to the target system from the source system.

Counting the number of records in the source and the target systems.

Boundary value analysis.

Validating the unique values of primary keys.

Data Correctness Validation It involves validating the values of data in the target system.

Misspelled or inaccurate data is found in table.

Null, Not Unique data is stored when you disable integrity constraint at the time of import.

Data Transform validation It involves creating a spreadsheet of scenarios for input values and expected results and then validating with end-users.

Validating parent-child relationship in the data by creating scenarios.

Using data profiling to compare the range of values in each field.

Validating if the data types in the warehouse are same as mentioned in the data model.

Data Quality Validation It involves performing number check, date check, precision check, data check, Null check, etc.

Example − Date format should be same for all the values.

Null Validation It involves checking the Null values where Not Null is mentioned for that field.
Duplicate Validation It involves validating duplicate values in the target system when data is coming from multiple columns from the source system.

Validating primary keys and other columns if there is any duplicate values as per the business requirement.

Date Validation check Validating date field for various actions performed in ETL process.

Common test-cases to perform Date validation −

  • From_Date should not greater than To_Date
  • Format of date values should be proper.
  • Date values should not have any junk values or null values
Full Data Validation Minus Query It involves validating full data set in the source and the target tables by using minus query.

  • You need to perform both source minus target and target minus source.
  • If the minus query returns a value, that should be considered as mismatching rows.
  • You need to match the rows in source and target using the Intersect statement.
  • The count returned by Intersect should match with the individual counts of source and target tables.
  • If the minus query returns no rows and the count intersect is less than the source count or the target table count, then the table holds duplicate rows.
Other Test Scenarios Other Test scenarios can be to verify that the extraction process did not extract duplicate data from the source system.

The testing team will maintain a list of SQL statements that are run to validate that no duplicate data have been extracted from the source systems.

Data Cleaning Unwanted data should be removed before loading the data to the staging area.

 

Question.1   What is a Data Warehouse?

Answer:    A Data Warehouse is a collection of data marts representing historical data from different operational data source (OLTP). The data from these OLTP are structured and optimized for querying and data analysis in a Data Warehouse.

Question.2   What is a Data mart?

Answer:   A Data Mart is a subset of a data warehouse that can provide data for reporting and analysis on a section, unit or a department like Sales Dept, HR Dept, etc. The Data Mart are sometimes also called as HPQS (Higher Performance Query Structu

Question.3   What is OLAP?

Answer:      OLAP stands for Online Analytical Processing. It uses database tables (Fact and Dimension tables) to enable multidimensional viewing, analysis and querying of large amount of data.

Question.4   What is OLTP?

Answer:   OLTP stands for Online Transaction Processing Except data warehouse databases the other databases are OLTPs. These OLTP uses normalized schema structure. These OLTP databases are designed for recording the daily operations and transactions of a business.

Question.5   What are Dimensions?

Answer:   Dimensions are categories by which summarized data can be viewed. For example a profit Fact table can be viewed by a time dimension.

Question.6   What are Confirmed Dimensions?

Answer:    The Dimensions which are reusable and fixed in nature Example customer, time, geography dimensions.

Question.7   What are Fact Tables?

Answer:    A Fact Table is a table that contains summarized numerical (facts) and historical data. This Fact Table has a foreign key-primary key relation with a dimension table. The Fact Table maintains the information in 3rd normal form.

A star schema is defined is defined as a logical database design in which there will be a centrally located fact table which is surrounded by at least one or more dimension tables. This design is best suited for Data Warehouse or Data Mart.

Question.8    What are the types of Facts?

Answer:    The types of Facts are as follows.

1.     Additive Facts: A Fact which can be summed up for any of the dimension available in the fact table.

2.     Semi-Additive Facts: A Fact which can be summed up to a few dimensions and not for all dimensions available in the fact table.

3.     Non-Additive Fact: A Fact which cannot be summed up for any of the dimensions available in the fact table.

Question.9   What are the types of Fact Tables?

Answer:    The types of Fact Tables are:

1.     Cumulative Fact Table: This type of fact tables generally describes what was happened over the period of time. They contain additive facts.

2.     Snapshot Fact Table: This type of fact table deals with the particular period of time. They contain non-additive and semi-additive facts.

Question.10   What is Grain of Fact?

Answer:   The Grain of Fact is defined as the level at which the fact information is stored in a fact table. This is also called as Fact Granularity or Fact Event Level.

Question.11   What is Factless Fact table?

Answer:   The Fact Table which does not contains facts is called as Fact Table. Generally when we need to combine two data marts, then one data mart will have a fact less fact table and other one with common fact table.

Question.12   What are Measures?

Answer:    Measures are numeric data based on columns in a fact table.

Question.13   What are Cubes?

Answer:   Cubes are data processing units composed of fact tables and dimensions from the data warehouse. They provided multidimensional analysis.

Question.14   What are Virtual Cubes?

Answer:     These are combination of one or more real cubes and require no disk space to store them. They store only definition and not the data.

Question.15   What is a Star schema design?

Answer:     A Star schema is defined as a logical database design in which there will be a centrally located fact table which is surrounded by at least one or more dimension tables. This design is best suited for Data Warehouse or Data Mart.

Question.16   What is Snow Flake schema Design?

Answer:      In a Snow Flake design the dimension table (de-normalized table) will be further divided into one or more dimensions (normalized tables) to organize the information in a better structural format. To design snow flake we should first design star schema design.

Question.17   What is Operational Data Store [ODS] ?

Answer:     It is a collection of integrated databases designed to support operational monitoring. Unlike the OLTP databases, the data in the ODS are integrated, subject oriented and enterprise wide data.

Question.18   What is Denormalization?

Answer:    Denormalization means a table with multi duplicate key. The dimension table follows Denormalization method with the technique of surrogate key.

Question.19   What is Surrogate Key?

Answer:    A Surrogate Key is a sequence generated key which is assigned to be a primary key in the system (table).[/signinlocker]

 

What are the validations you perform after data loads for source to target?Once the data is loaded in the target tables, we have to do below validations:

1. Counts between source and Target
2. Duplicates check
3. Null Values check
4. Mandatory Values check
5. Data compare (source minus Target and Target minus source)

Why landing area required in DWH ?
Diff between landing area and staging area in DWH ?Landing Area or Landing zone:

This area is used to bring data from multiple source to single zone (say like ODS).
Staging area:
In this area, logics are applied on the data by creating views.

How to find number of duplicate records?

Code
  1. SELECT count(Field_NAME) FROM Table_name
Code
  1. SELECT count(*),TABLE.* FROM TABLE GROUP BY ALL COLUMNS HAVING Count(*)>1

What are the validations you perform from landing area to staging area?

 

We check data types, duplication, file types

Find out the Number of Columns in Flat File

In ETL testing if the src is flat file, then how will you verify the data count validation?

Say file Name: abc.txt and delimited with a comma(,)

answer:
hear -1 abc.txt | awk -F “,” {print $NF}

Using wc command in UNIX, you can check the count of records.

What are the various methods of getting incremental records or delta records from the source systems?

There are three way to do increment load in ETL as per my knowledge:

1: Delta loading using parameter file.
2: Delta loading using control table.
3: Delta loading using max variable where max variable is last updated date.

Once loaded the src data into ODS we can fetch the records based on the flag such as Insert or Update

Validate Source to Target ETL mapping Using Single SQL Query

I

Test Target Phone Number Format

Source have phone number with 8 digit and target has 10 digit. How to test target phone number has 10 digit start from +91

UNANSWERED QUESTION

http://www.geekinterview.com/Interview-Questions/Data-Warehouse/ETL/page5

What is a three-tier data warehouse?

Most data warehouses are considered to be a three-tier system. This is essential to their structure. The first layer is where the data lands. This is the collection point where data from outside sources is compiled. The second layer is known as the ‘integration layer.’ This is where the data that has been stored is transformed to meet company needs. The third layer is called the ‘dimension layer,’ and is where the transformed information is stored for internal use