TERADATA-2

LOCKS

Performance Tuning

JOIN STRATEGY

TERADATA INDEX

TERADATA JOIN INDEX

 

 

LOCKS

 199.What does” Pseudo Table” Locks mean in EXPLAIN Plan in Teradata?

  • It is a false lock which is applied on the table to prevent  two users from getting conflicting locks with all-AMP requests.
  • PE will determine an particular AMP to manage all AMP LOCK requests for given table and Put Pseudo lock on the table.
  • Pseudo tables are the dummy tables, Whenever we place a request which involves full table scan and needs data to be retrieved from all AMPS then the parsing engine identifies a gatekeeper AMP which can command all other AMP to lock for a particular request from the user and does not allow multiple user to access the same table to prevent global deadlock.

200.How do you whether table is locked or not?

Just run the following query on the table.

Lock Table DBNAME.TABLENAME write nowait

Select * from DBNAME.TABLENAME;

If this table is locked currently then , then the above statement would fail as Nowait will not wait for locks to be released on that  table .

201.How can you apply access lock on table rather than read lock?

We can override default read lock on any table by explicitly mentioning the ACCESS LOCK before using the table.

LOCK TABLE EMPLOYEES FOR ACCESS SELECT * FROM EMPLOYEES;

202.What are different type of locks in teradata ?

In Teradata , LOCKS are used to maintain integrity of data present in the tables. Teradata , itself decides the kind of lock to implement however in some cases users can also decide what kind of lock, the query should use.

In TERADATA, locks can be implemented at three levels:

DATABASE , TABLE and ROW HASH.

And there are 4 types of LOCKS available in TERADATA:

1) ExclusiveSuch LOCKS are applied on a table when it is undergoing any STRUCTURAL change. All DDL commands apply this LOCK. If this lock is applied then no other locks can be applied on that table.

2) WriteSuch LOCKS are applied when any tables data demography changes by either DELETE,INSERT or UPDATE operation on it. So whenever a DML operation is performed on the table , such locks are applied. Only ACCESS locks are permissible when WRITE locks are applied on the table. And ACCESS Locks may lead to STALE DATA READ from the table since data demography would be changed by DML operations holding WRITE LOCKS.

3)READSuch LOCKS are applied when user tries to READ data from the table and don’t want to fetch STALE DATA. READ locks maintains the data integrity as the DATA in the tables cannot be modified while READ LOCK is applied on the tables. READ LOCK only allow READ or ACCESS LOCK on the table. READ Locks are the default lock used by Teradata while fetching data from any Table.

4)ACCESSSuch locks are applied when table needs to be accessed and the data demographics is not that important; even stale data is ok for the process. ACCESS LOCKS allows READ, WRITE or ACCESS Locks to be applied on the table. However EXCLUSIVE locks are not allowed.

Tips It can be considered as a good practice to apply ACCESS locks on the source table when multiple SQL Queries are using same Source Table. Even if the TARGET table is different still it can lead to bad performance as by default TERADATA applied READ lock so other queries may have to wait to change the demography in SOURCE table. Understanding LOCKS and LOCKING table manually can be very useful in an environment where several tables are loaded parallely. Teradata automatically blocks the query trying to apply incompatible LOCK on the table already locked by some other query. You can use below command before your actual DML commands:

LOCKING SOURCE_TABLE FOR ACCESS INSERT INTO TARGET_TABLE SELECT * FROM SOURCE_TABLE; The other good practice is to use PRIMARY INDEX column while fetching data from SOURCE table. In such cases, the query becomes an AMP operation and Teradata applies lock at ROW HASH level and not at Table level.

203.  What is the particular designated level at which a LOCK is liable to be applied in Teradata?

  • Table Level – All the rows that are present inside a table will certainly be locked.
  • Database Level Lock – All the objects that are present inside the database will be locked.
  • Row Hash Level Lock – Only those rows will be locked which are corresponding to the particular row.

Performance Tuning

204. Highlight the need for Performance Tuning.

Performance tuning in Teradata is basically done to identify all the bottlenecks and then resolve them.

205. Comment whether bottleneck is an error or not.

Technically, bottleneck is not a form of error, but it certainly causes a certain amount of delay in the system.

 206.How can bottlenecks be identified?

There are basically four ways of identifying a bottleneck. These are: –

  • Teradata Visual Explain
  • Explain Request Modifier
  • Teradata Manager
  • Performance Monitor

207.What are the Performance improvement techniques available in Teradata?

Performance tuning and optimization of a query involves

  • First of all use EXPLAIN plan to see how the query is performing.
  • Keywords like Product joins, low confidence are measures of poor performance.
  • Make Sure, STATS are collected on the columns used in WHERE Clause and JOIN columns.
  • If STATS are collected , explain plan will show HIGH CONFIDENCE This tells the optimizer about the number of rows in that table which will help the optimizer to choose the redistribution/duplication of smaller tables.
  • Check the joining columns & WHERE Clause whether PI, SI or PPI are used.
  • Check whether proper alias names are used in the joining conditions.
  • Split the queries into smaller subsets in case of poor performance.
  • collecting statistics on join columns
  • avoiding cross product join
  • selection of appropriate primary index (to avoid skewness in storage)
  • and using secondary index.
  • Avoiding NUSI is advisable
  •  Check for the recommendations of stats to be collected for the columns.
  •  Try to avoid casting in join conditions
  •  Use secondary index appropriately.
  • Use join index if necessary
  • choose the Primary Index to avoid skew,
  • avoid using CAST, SUBSTR, CASE, POSITION, as these functions consumes a lot of resource.
  • If two huge tables are joined almost regularly, create Joined Index.
  • Avoid Redistribution when possible
  • Use sub-selects instead of big “IN” lists
  • Use derived tables
  • Use GROUP BY instead of DISTINCT ( GROUP BY sorts the data locally on the VPROC. DISTINCT sorts the data after it is redistributed)
  • Use Compression on large tables
Optimization of queries improves performance of TD and helps the PE to generate the most efficient execution plan.To Optimize a query

208.Below Steps are useful for performance tuning

  • A) Explain the EXPLAIN: Check for EXPLAIN plan to see how exactly Teradata will be executing the query. Try to understand basic keywords in Explain Plan like confidence level, join strategy used, re-distribution happening or not.
  • B) Collect STATS: The stats of the columns used join conditions should updated. Secondary Indexes without proper STATS can be of little or no help. Check for STATS status of the table.
  • C) Use Proper PI: If the Primary index is not properly defined in any one or all of the tables in the query. Check if the PI of target table loaded is unique.
  • D) Use PPI: If there is Partition Primary Index created on a table, try to use it. If you are not using it in filter condition, it will degrade the performance.
  • E) No FUNCTIONS in Conditions: Try to avoid using function in join conditions. Ex Applying COALESCE or TRIM etc causes high CPU consumption.
  • F) Use PPI: If Partition Primary Index is defined in tables try to use it. If you are not using it in filter condition, it will degrade the performance.
  • G) Same column DATA TYPES: Define same data type for the joining columns.
  • H) Avoid IN clause in filter conditions: When there can be huge number of values in where conditions, better option can be to insert such values in a volatile table and use volatile table with INNER JOIN in the main query.
  • I) Use Same PI in Source & Target: PI columns also can help in saving the data into disk .If the Source and Target have the same PI, data dump can happen very efficiently form source to target.
  • J) Collect STATS on VOLATILE table: Collect stats on volatile tables where required can save AMPCPU. Remove stats if already present where it is not getting used. If the volatile table contains UNIQUE PI, then go for sample stats rather than full stats.
  • K) DROPPING volatile tables explicitly: Once volatile tables is no more required you can drop those. Donít wait for complete procedure to be over. This will free some spool space immediately and could prove to be very helpful in avoiding No More Spool Space error.
  • L) NO LOG for volatile tables: Create volatile tables with NO LOG option.
  • M) Check DBQL Stats: Keep your performance stats accessible. Target the most AMPCPU consuming query first.
  • N) UPDATE clause: Do not write UPDATE clause with just SET condition and no WHERE condition. Even if the Target/Source has just one row, add WHERE clause for PI column.
  • O) DELETE & INSERT: Sometimes replacing UPDATE with DELETE & INSERT can save good number of AMPCPU. Check if this holds good for your query.
  • P) Query SPLITS: Split queries into several smaller queries logically and use volatile tables with proper PI.
  • Q) Try MSR: If same target table is loaded multiple times, try MSR for several sections. This will speed the final MERGE step into target table and you may see good CPU gain.
  • R) Try OLAP Functions: Check if replacing co-related sub query with OLAP function may result in AMPCPU saving.
  • S) Avoid DUPLICATE data: If the join columns in the tables involved in the query have duplicates. Use Distinct or Group by, load into a volatile table, collect stats and use the volatile table.
  • T) Use Proper JOINS: If joins used, donít use right outer, left or full joins where inner joins is sufficient.
  • U) User proper ALIAS: Check the aliases in the joins. Small mistake could lead to a product join.
  • V) Avoid CAST: Avoid unnecessary casting for DATE columns. Once defined as DATE, you can compare date columns against each other even when they are in different format. Internally, DATE is stored as INTEGER. CAST is required mainly when you have to compare VARCHAR value as DATE.
  • W) Avoid UDF: Most of the functions are available in Teradata for data manipulations. So avoid User Defined Functions
  • X) Avoid FULL TABLE SCAN: Try to avoid FTS scenarios like SI should be defined on the columns which are used as part of joins or Alternate access path. Collect stats on SI columns else there are chances where optimizer might go for FTS even when SI is defined on that particular column
  • Y) Avoid using IN/NOT IN: For large list of values, avoid using IN /NOT IN in SQLs. Write large list values to a temporary table and use this table in the query
  • Z) Use CONSTANTS: Use constants to specify index column contents whenever possible, instead of specifying the constant once, and joining the tables. This may provide a small savings on performance.
Sql Tuning:
  • Use Top N option and column names to see sample data in a table. If there is a just “select and  * “ then the optimizer has to replace * with all columns from that table Example: Lock Row For Access Select Top 5 Empno, Ename From Employee;
  • Use “Create AS Table” if there is a need to create a table from existing  table. It operates on an efficient block-by-block basis that bypasses journaling. You can create only structure or structure & data or structure,data and stats. Example: CREATE TABLE  NEW_Table AS OLD_TABLE WITH DATA and STAT
  • Use Insert/Select if there is a need to copy data from one table to another empty table. INSERT/SELECT operates on an efficient block-by-block basis that bypasses journaling. Insert/Select will handle conversions also. Example: INSERT into Summary_Table SELECT store, region, sum (sales), count (sale_item) FROM Region_1 GROUP BY 1,2
  • Use Nowait option if you don’t want your request to wait in the queue. Example: Locking Emp For Read Nowait SELECT * FROM Emp
  • Nesting view could add a substantial time to parser.Try to reduce nesting views creation.
  • If values are nearly unique values then “Distinct” clause may outperform “Group By”. When there are many duplicate value then “Group By” performs better than “Distinct”.
  •  Make sure that the join columns have always same data type. Otherwise one of the table rows would have to undergo translation and does full table scan even though there are stats on the join column.
  •  If you are doing a lot of deletes on rows from a table, consider the use of MultiLoad instead of BTEQ / SQL Assistant to do the deletes. MultiLoad completely avoids use of the TJ and is restartable.Another approach is, do Insert/Select into an empty table then drop the original table and then rename the new table to original table.
  •  Check the data distribution of primary index. Example: Select HashAmp (HashBucket (HashRow (Last_Name, Fisrt_name))) , Count (*) From Emp Group By 1 Order By 2 Desc
  •  If possible try to avoid using any functions on join columns or on where columns. For example if substr or coalesce.. Etc. functions used on where column then the optimizer is not going use stats even though if there are stats on the column.
  • If there is sql with an outer join then make sure that the inner table filter condition should be present in on condition then no extra data is loaded into spool file from inner table for further processing. Filter condition for outer table should be present in where condition.

Example: Select A.Empno , A.Ename , B.Dname From Emp A left outer join Dept B On (A.Deptno = B.Deptno And B.Deptno=10 ) Here Emp table called as Outer Table and Dept  table called as Inner table.

  •  You can re-write correlated or minus queries with left outer join. Correlated query has to be executed for every row returned by the outer query. Correlated queries are expensive for the optimizer.

Example:  Get all departments that have no employees.

Select A.Deptno , A.Dname , A.Location From Dept A Where Not Exists (Select ‘1’ ;            From Emp B & nbsp;           Where B.DeptNo = A.Deptno )

Above query can be re-written like below and gives better performance than above.

Select A.Deptno , A.Dname , A.Location From Dept A Left Outer Join Emp B On (A.DeptNo =B.Deptno ) Where B.Deptno Is Null

  •  Avoid use of large list of values in IN/NOT IN clauses. Store them in some temporary  table and use that table in the query.
  • If intermediate tables are used for processing then make sure that it has same PI of source and target tables. Collect the stats on intermediate table NUPI after loading.
  •   If “like” used in a where clause, it is better to try to use one or more leading character in the clause, if at all possible otherwise the optimizer is going to full table scan even though if there is a index on this column.
  •   Use “Not Null” for columns, which are declared as Nullable in table definition so the tables skew can be avoided in joins.
  •  Always check the plan and run the below command to see if optimizer has recommended any stats for better execution. Collect the stats only if the optimizer is recommending high confidence.Avoid the stats if the columns have low cardinality and multi column stats (more than three).Diagnostic help stats on for the session
  • Create join index when the table is consistently joined with other tables other than primary index and retrieval benefits are greater than setting up and maintenance.

JOIN STRATEGY

209.What are the types of Join Strategies available in Teradata?

Join Strategies are used by the optimizer to choose the best plan to join tables based on the given join condition.

  • Merge (Exclusion)
  • Nested
  • Row Hash
  • Product (including Cartesian Product joins)

 

Merge (Exclusion)

  • . It is adopted when the join conditions are based on equality (=).
  • There is a prerequisite though: the two tables must be sorted based on the join column in advance (actually it’s sorted based on the join column row hash sequence).
  • That brings a great advantage for this type of join: both tables only need to be scanned once, in an interleaved manner.
  • Merge join is not necessarily always better than product join, due to the fact that sorting is required.
  • If both tables are huge, sorting can be a tremendous effort.

Requirements:

  • The rows to be joined have to be located on a common AMP
  • Both spools have to be sorted by the ROWID calculated over the join column(s)

Process:

The ROWHASH of each qualifying row in the left spool is used to look up matching rows with identical ROWHASH in the right spool (by means of a binary search as both spools are sorted by ROWID)

Possible Join Preparations required:

  • Re-Distribution of one or both spools by ROWHASH or
  • Duplication of the smaller spool to all AMPs
  • Sorting of one or both spools by the ROWID

teradata merge join

while joining two tables the data will be redistributed or duplicated across all AMPs to make sure joining rows are in the same AMPs.

Relocation of rows to the common AMP can be done by redistribution of the rows by the join column(s) ROWHASH or by copying the smaller table as a whole to all AMPs.

If one table PI is used and Other table PI not used, redistribution/duplication of the table will happen based on the table size.In these cases Secondary Indexes will be helpful.

The common AMP of rows from two spools being joined is defined by the join columns.

 

Case 1 – P.I = P.I joins

  • The Primary Indexes (or any other suitable index) of both tables equals the join columns:
  • there is no redistribution of data over amp’s. Since amp local joins happen as data are present in same AMP and need not be re-distributed.
  • These types of joins on unique primary index are very fast
  • No join preparation is needed as the rows to be joined are already on the common AMP

Case 2 – P.I = non Index joins

  • Only the Primary Index (or any other suitable index)  of one table matches the join columns: The rows of the second table have to be relocated to the common AMP
  • data from second table will be re-distributed on all amps since joins are happening on PI vs. non Index column.
  • Ideal scenario is when small table is redistributed to be joined with large table records on same amp -Data in small table is duplicated to Every AMP where it is joined locally with large table
    • 1- duplicate all rows of one table onto every AMP (The duplication of all rows is done when the non-PI column is on a small table),
    • 2- redistribute the rows of one table by hashing the non-PI join column and sending them to the AMP containing the matching PI row,

Case 3 – non Index = non Index joins 

  • Neither the Primary Index of the first table (or any other suitable index) nor the Primary Index (or any other suitable index)  of the second table matches the join columns:
  • data from both the tables are redistributed on all AMPs.
  • This is one of the longest processing queries; Care should be taken to see that stats are collected on these columns.
  • redistribute both tables by hashed join column value

Nested Join

  • Nested Join is the most efficient join method in Teradata.
  • It is also the only join method that don’t always use all the AMPs.

In order to make Nested Join picked, the following conditions must be satisfied:

1) The join condition is based on equality;

2) The join column is a unique index on one table;

3) The join column is any index on another table.

  • First only one single row will be retrieved from one table with the help of the unique index, and then based on the row hash of that row, another table is accessed by some index.
  • Nested Join is one of the most precise join plans   suggested by Optimizer .
  • Nested Join works on UPI/USI used in Join statement and is used to retrieve the single row from first table .
  • It then checks for one more matching rows in second table based on being used in the join using an index (primary or secondary) and returns the matching results.

 

Requirements:

  • Spool 1 allows a unique ROWHASH access (a unique index is defined)
  • Spool 2 allows any kind of ROWHASH access (a unique or not unique is index defined)

Process:

  • The qualifying row of spool 1 is accessed by usage of any unique index.
  • The row is relocated to the AMP owning the rows of spool 2
  • Spool 2 is full table scanned and each row is combined with the one row from Spool 1

Possible Join Preparations required:

  • None

Example:

Select EMP.Ename , DEP.Deptno, EMP.salary

from

EMPLOYEE EMP ,

DEPARTMENT DEP

Where EMP.Enum = DEP.Enum

and EMp.Enum= 2345; this results in nested join

 Hash join

  • Hash Join is also based on equality condition (=).
  • Hash Join Hash Join gets its name from the fact that one smaller table is built as “hash-table”, and potential matching rows from the second table are searched by hashing against the smaller table.
  • Usually optimizer will first identify a smaller table, and then sort it by the join column row hash sequence.
  • If the smaller table is really small and can fit in the memory, the performance will be best. Otherwise, the sorted smaller table will be duplicated to all the AMPs.
  • Then the larger table is processed one row at a time by doing a binary search of the smaller table for a match.
  •  We can say Hash Join to be close relative of Merge based on its functionality.

In case of merge join, joining would happen in same amp.

  •   In Hash Join, one or both tables which are on same amp are fit completely inside the AMP’s Memory   .
  • Amp chooses to hold small tables in its memory for joins happening on ROW hash.
  • The Sprinter, but only if executed in FSG Cache

Requirements:

  • The rows to be joined have to be located on a common AMP
  • The smaller spool is sorted by the ROWHASH calculated over the join column(s) and kept in the FSG cache
  • The bigger spool stays unsorted

Process:

  • The bigger spool is full table scanned row by row
  • Each ROWID from the bigger spools is searched in the smaller spool (by means of a binary search)

Possible Join Preparations required:

  • Re-Distribution of the smaller spool by ROWHASH or
  • Duplication of the smaller spool to all AMPs
  • Sorting of the smaller spools

hash_join

Advantages of Hash joins are

  1. They are faster than Merge joins since the large table doesn’t need to be sorted.
  2. Since the join happening b/w table in AMP memory and table in unsorted spool, it happens so quickly.

 Exclusion Join

 

  •  Exclusion Join This join strategy is used to find non-matching rows.
  • If the query contains “NOT IN” or “EXCEPT”, exclusion join will be picked.
  • As a matter of fact, this kind of join can be done as either Merge Join or Product Join.
  • One thing worth noticing: exclusion merge join is based on set subtraction operation, and a three-value logic (TRUE, FALSE, UNKNOWN) will be used when comparisons is done on nullble columns (or temporary result set).

These type of joins are suggested by optimizer when following are used in the queries

  • – NOT IN
  • – EXCEPT
  • – MINUS
  • – SET subtraction operations

Select

EMP.Ename , DEP.Deptno, EMP.salary

from

EMPLOYEE EMP

WHERE

EMP.Enum

NOTIN

( Select Enum from

DEPARTMENT DEP

where Enum isNOTNULL );

Please make sure to add an additional WHERE filter “with <column> IS NOT NULL since usage of NULL in a NOT IN <column> list will return no results.

Exclusion join for following NOT In query has 3 scenarios

Case 1: matched data in “NOT IN” sub Query will disqualify that row

Case 2: Non-matched data in “NOT IN” sub Query will qualify that row

Case 3: Any Unknown result in “NOT IN” will disqualify that row – (‘NULL’ is a typical example of this scenario).

Product join

  • to find a match between two tables with a join condition which is not based on equality (>, <, <>), or join conditions are ORed together.
  • The reason why we call it “Product” join is that, the number of comparisons required is the “product” of the number of rows of both tables. For example, table t1 has 10 rows, and table t2 has 25 rows, then it would require 10×25=250 comparisons to find the matching rows.
  • When the WHERE clause is missing, it will cause a special product join, called Cartesian Join or Cross Join, 

Requirements:

  • The rows to be joined have to be located on the AMP
  • No spool needs to be sorted!

Process:

  • A full table scan is done on the smaller spool and
  • Each qualifying row of spool 1 is compared against each row of spool 2

Possible Join Preparations required:

  • Re-Distribution of one or both spools by ROWHASH or
  • Duplication of the smaller spool

product_join

 

 

Sliding Window Merge Join

The Teradata Traditional Merge Join

Requirements:

  • The rows to be joined have to be located on a common AMP
  • Both spools have to be sorted by the ROWID calculated over the join column(s)

Possible Join Preparations required:

  • Re-Distribution of one or both spools by ROWHASH or
  • Duplication of the smaller spool to all AMPs
  • Sorting of one or both spools by the ROWID

Two different algorithms can be used for the Merge Join:

1. The Fast Path Algorithm

The comparision takes place alternating from both sides starting wit the left table. The algorithm tries to join rows with matching rowhash. If there is no match, the pointer is positioned on the row with the next highest rowhash value and the comparison continues until all rows have been compared.

This method is used if left and right table are full table scanned:

teradata fastpath merge join

2. The Slow Path Algorithm

This algorithm reads each row from the left table and tries to match it against rows with the same rowhash from the right table.

This method is used if the left table is accessed via an index:

teradata slow path merge join

 

Traditional Merge Joins have a big advantage over other join types which don’t require both tables to be sorted by rowhash of the join columns:

Each data block of both tables is accessed exactly once (the algorithm basically slides down on each of the rowhash sorted tables) and it is therefore less sensitive to the size of available FSG cache.

The Teradata Sliding Window Merge Join

The sliding window merge join can be considered as an advancement of the traditional merge join. After introducing the feature of row partitioning, it was required to find an algorithm which allows to join a row partitioned table (PPI table) with

  • Another PPI table having different partition characteristics
  • A Non-partitioned table

The optimizer has the possibility to change a PPI table into a NPPI table and vice versa. This step can be followed by a traditional merge join. Nevertheless, in order to avoid this join preparation step, a sliding window merge join can be executed without the requirement of the restructuring of tables.

As opposed to non-partitioned tables (NPPI tables), which are only sorted by rowhash, PPI tables are sorted on each AMP by two levels:

1. Each row is placed into its assigned row partition.
2. Within each row partition the rows are sorted by rowhash (the same way the rows of a NPPI table are stored).

The Sliding Window Merge Join was designed to be able to directly join

  • A NPPI table with a PPI table or
  • Two PPI tables with different partitioning

Directly means without changing a PPI table into a NPPI table or the other way around.

In order to understand the sliding window join process, one has to remember how PPI table rows are stored:

 Data rows are sorted by rowhash within the data blocks, but several row partitions can hold the same rowhash value! 

The easiest way of merge joining a NPPI table against a PPI table seems by join the NPPI table against each PPI table partition. This would be a reasonable solution because the NPPI table, and each partition of the PPI table, are sorted by rowhash allowing to execute a binary search within the data blocks of each partition.

Nevertheless, for performance reasons, Teradata implements a slightly different, but faster, algorithm:

The AMP reads the first data block from the NPPI table and one data block per partition from the PPI table. The rows are joined (binary search) and the algorithm moves down both tables by reading data block after data block from both tables.

The data block of the NPPI tables stays in FSG cache as long as data blocks from any partition of the PPI tables can be matched. If no more rows can be matched, the next data block of the NPPI tables is moved into the FSG cache and above described process is repeated, until the last data blocks of each table and partition is reached.

This process requires each data block of each table to be touched  exactly once (similar to a traditional merge join).

This process could theoretically result in a similar join performance, as we can reach with a traditional merge join, but there is one restriction degrading performance: The available FSG cache memory.

If the FSG cache is not big enough to hold at least one data block from each PPI table partition, the optimizer has to split the join process into so-called windows.

Assume for example, that the PPI tables consists of 4 partitions but there is only space in FSG cache for the first data block of  2 partitions. In this case the process would define 2 windows, each one consisting of 2 partitions:

  1. Join the NPPI table against the first 2 partitions
  2. Join  the NPPI table against the remaining 2 partitions

 

SlidingGFX

As a result, the NPPI table has to be read twice and the join becomes more costly. The more windows are needed, the more costly the join becomes. If the NPPI table is small enough, caching effects could decrease the negative performance impact of having several windows.

The sliding window merge join has a similar performance pattern like a traditional merge join, if most of the partitions can be eliminated before the join takes place. The best case scenario is, when there is sufficient FSG cache available to join all partitions at once. This kind of setup is called single window merge join.

The join of two PPI tables with different partitioning is implemented similar to the join between a NPPI and a PPI table:

The left table and the right table rows are split into windows (each window containing a part of the row partitions) and each window from the left table is joined against each window from the right table.

Here is an example: If the join needs to create 5 windows from the left PPI table and 2 windows from the right PPI table, this would result in a product join of 5*2 windows:

A sliding window merge join is product joining each window of the left table with each window of the right table. The join process between 2 windows is similar to a traditional merge join process with a binary search on both tables.

 

210.What is the default join strategy in Teradata?

  • There is no “default” join strategy.
  • Optimizer decides the type of strategy based on the best retrieval path and other parameters to execute the query.
  • Each join strategy has its own pros and cons, and it’s hard to say which one is the best, depending on different circumstances.
  • The optimizer will choose the best join strategy based on data demographics, statistics and indexes if any of them are available.
  • Using EXPLAIN can help find out what join strategies are to be adopted. No matter which join strategy, it is always applied between two tables. The more tables, the more join steps.
  • Rows must be on the same AMP to be joined.
  • So row distribution or duplication is unavoidable for some join strategies.
  • types of join
    • Product Join
    • Merge Join
    • Exclusion Join
    • Hash Join
    • Nested Join

 

211.What are the types of JOINs available in Teradata?

Types of JOINs are  :

  • Inner Join,
  • Outer Join (Left, Right, Full), S
  • elf Join, Cross Join and
  • Cartesian Joins.

The key things to know about Teradata and Joins

  • Each AMP holds a portion of a table.
  • Teradata uses the Primary Index to distribute the rows among the AMPs.
  • Each AMP keeps their tables separated from other tables like someone might keep clothes in a dresser drawer.
  • Each AMP sorts their tables by Row ID.
  • For a JOIN to take place the two rows being joined must find a way to get to the same AMP.
  • If the rows to be joined are not on the same AMP, Teradata will either redistribute the data or duplicate the data in spool to make that happen.

TERADATA INDEX

213.  PPI(Partition Primary Index).

  • PPI is used to improve performance for large tables when you submit queries that specify a range constraint.
  • PPI allows you to reduce the number of rows to be processed by using partition elimination.
  • PPI will increase performance for incremental data loads, deletes, and data access when working with large tables with range constraints

Lets take Order_Table, where we have both January and February dates in column Order_Date.

TeradataWiki-Teradata PPI
The Order_Table spread across the AMPs.Notice that January and February dates are mixed on every AMP in what is a random order. This is because the Primary Index is Order_Number.
When we apply Range Query, that means it uses the keyword BETWEEN.
The BETWEEN keyword in Teradata means find everything in the range BETWEEN this date and this other date. We had no indexes on Order_Date so it is obvious the PE will command the AMPs to do a Full Table Scan. To avoid full table scan, we will Partition the table.
TeradataWiki-Teradata PPI

After Partitioned Table,
The example of AMPs on the top of the page. This table is not partitioned.

The example of AMPs on the bottom of the page. This table is partitioned.

Each AMP always sorts its rows by the Row-ID in order to do a Binary Search on Primary Index queries.

Notice that the rows on an AMP don‘t change AMPs because the table is partitioned. Remember it is the Primary Index alone that will determine which AMP gets a row. If the table is partitioned then the AMP will sort its rows by the partition.
TeradataWiki-Teradata PPI
Now we are running our Range Query on our Partitioned Table,each AMP only reads from one partition. The Parsing Engine will not to do full table scan. It instructs the AMPs to each read from their January Partition. You Partition a Table when you CREATE the Table.
A Partitioned Table is designed to eliminate a Full Table Scan, especially on Range Queries.

Types of partitioning:

RANGE_N Partitioning

Below is the example for RANGE_N Partition by day.
CREATE TABLE ORDER_TABLE

(
ORDER_NO INTEGER NOT NULL,
CUST_NO INTERGER,
ORDER_DATE DATE,
ORDER_TOTAL DECIMAL(10,2)
)
PRIMARY INDEX(ORDER_NO)
PARTITION BY RANGE_N
(ORDER_DATE BETWEEN
DATE ‘2012-01-01’ AND DATE ‘2012-12-31’
EACH INTERVAL ‘7’ DAY);

Case_N Partitioning
CREATE TABLE ORDER_TABLE

(
ORDER_NO INTEGER NOT NULL,
CUST_NO INTERGER,
ORDER_DATE DATE,
ORDER_TOTAL DECIMAL(10,2)
)
PRIMARY INDEX(ORDER_NO)

PARTITION BY CASE_N

(ORDER_TOTAL < 1000,
 ORDER_TOTAL < 2000,
 ORDER_TOTAL < 5000,
 ORDER_TOTAL < 10000,
 ORDER_TOTAL < 20000,
 NO CASE, UNKNOWN);

The UNKNOWN Partition is for an Order_Total with a NULL value. The NO CASE Partition is for partitions that did not meet the CASE criteria.

For example, if an Order_Total is greater than 20,000 it wouldn‘t fall into any of the partitions so it goes to the NO CASE partition.

Multi-Level Partitioning:
You can have up to 15 levels of partitions within partitions.

CREATE TABLE ORDER_TABLE
(
ORDER_NO INTEGER NOT NULL,
CUST_NO INTERGER,
ORDER_DATE DATE,
ORDER_TOTAL DECIMAL(10,2)
)
PRIMARY INDEX(ORDER_NO)
PARTITION BY (RANGE_N
(ORDER_DATE BETWEEN
DATE ‘2012-01-01’ AND DATE ‘2012-12-31’
EACH INTERVAL ‘1’ DAY)
CASE_N (ORDER_TOTAL < 5000,
 ORDER_TOTAL < 10000,
 ORDER_TOTAL < 15000,
 ORDER_TOTAL < 20000,
 NO CASE, UNKNOWN));

 Character Based Partitioning(New Feature V13.10) :
There are four new data types available for Character Based PPI. They are CHAR, VARCHAR, GRAPHIC, and VARGRAPHIC.

Example for RANGE Based Character PPI

 CREATE TABLE EMP_TBL
 (
 EMP_NO INTEGER NOT NULL,
 DEPT_NO INTEGER,
 FIRST_NAME CHAR(20),
 LAST_NAME VARCHAR(20),
SALARY DECIMAL(10,2),
 ) PRIMARY INDEX(EMP_NO)
 PARTITION BY RANGE_N
(LAST NAME BETWEEN ( ‘A ‘,’B ‘,’C ‘,’D ‘,’E ‘,’F ‘,’G ‘,’H ‘,
‘I ‘,’J ‘,’K ‘,’L ‘,’M ‘,’N ‘,’O ‘,’P ‘,’Q ‘,’R ‘,’S ‘,’T ‘,
‘U ‘,’V ‘,’W ‘,’X ‘,’Y ‘,’Z ‘ AND ‘ZZ’,UNKNOWN));

Example for CASE Based Character PPI

CREATE TABLE PRODUCT_TBL
(PRODUCT_ID INTEGER NOT NULL
,PRODUCT_NAME CHAR(30)
,PRODUCT_COST DECIMAL(10,2)
,PRODUCT_DESCRIPTION VARCHAR(100)
)PRIMARY INDEX(PRODUCT_ID)
PARTITION BY CASE_N
(PRODUCT_NAME < ‘Apples’
PRODUCT_NAME < ‘Bananas’
PRODUCT_NAME < ‘Cantaloupe’
PRODUCT_NAME < ‘Grapes’
PRODUCT_NAME < ‘Lettuce’
PRODUCT_NAME < ‘Mangos’
PRODUCT_NAME >=’Mangos’ and <=’Tomatoes’);

Ex-Query: Find all Products between Apples and Grapes?

Ans: SELECT * FROM PRODUCT_TBL WHERE PRODUCT_NAME BETWEEN ‘Apples’ and ‘Grapes’;
Partitioning Rules:
  • A table can have up to 65,535 Partitions.
  • Partitioning never determines which AMP gets row.
  • Partitioning determines how an AMP will sort the row on its own.
  • Table can have up to 15 levels of partitions.
  • A table cannot have an UPI as the Primary Index if the Partition table does not include PI.
  • Total 3 forms of Partitioning Simple, RANGE and CASE.
Advantages of partitioned tables:
  • They provide efficient searches by using partition elimination at the various levels or combination of levels.
  • They reduce the I/O for range constraint queries
  • They take advantage of dynamic partition elimination
  • They provide multiple access paths to the data, and an MLPPI provides even more partition elimination and more partitioning expression choices, (i.e., you can use last name or some other value that is more readily available to query on.)
  • The Primary Index may be either a UPI or a NUPI; a NUPI allows local joins to other similar entities
  • Row hash locks are used for SELECT with equality conditions on the PI columns.
  • Partitioned tables allow for fast deletes of data in a partition.
  • They allow for range queries without having to use a secondary index.
  • Specific partitions maybe archived or deleted.
  • May be created on Volatile tables; global temp tables, base tables, and non-compressed join indexes.
  • May replace a Value Ordered NUSI for access.
  • Tables that hold yearly information don’t have to be split into 12 smaller tables to avoid Full Table Scans (FTS). This can make modeling and querying easier.
  • Fastload and Multiload work with PPI tables, but not with all Secondary Indexes.
  • Range queries don’t have to utilize a Full Table Scan.
  • Deletions of entire partitions are lightning fast.
  •  PPI provides an excellent solution instead of using Secondary Indexes

Disadvantage of partitioned tables:

  • A two-byte Partition number is added to the ROW-ID and it is now called a ROW KEY. The two-bytes per row will add more Perm Space to a table.
  • Joins to Non-Partitioned Tables can take longer and become more complicated for Teradata to perform.
  • Basic select queries utilizing the Primary Index can take longer if the Partition number is not also mentioned in the WHERE clause of the query.
  • You can’t have a Unique Primary Index (UPI) if the Partition Number is not at least part of the Primary Index. You must therefore create a Unique Secondary Index to maintain uniqueness.
  • Access via the Primary Index may take longer
    Full table joins to a NPPI table with the same PI may take longer

 214.Highlight the differences between Primary Key and Primary Index.

  • Primary index is quite mandatory, whereas Primary Key is optional.
  • Primary Index has a limit of 64 tables/columns, whereas Primary Key does not have any limit.
  • Primary Index allows duplicates and nulls, whereas Primary Key doesn’t.
  • Primary Index is a physical mechanism, whereas Primary Key is purely logical mechanism.
  • Primary Key is a concept that uniquely identifies a particular row of a table.
  • Primary INDEX is used for finding best access path for data retrieval and data insertion and Primary KEY is used for finding each rows uniquely just like in other RDBMS.

215.What are the types of PI (Primary Index) in Teradata?

There are two types of Primary Index.

Unique Primary Index ( UPI) and

Non Unique Primary Index (NUPI).

  • By default, NUPI is created when the table is created.
  • Unique keyword has to be explicitly given when UPI has to be created.
  • UPI will slower the performance sometimes as for each and every row , uniqueness of the column value has to be checked and it is an additional overhead to the system but the distribution of data will be even.
  • Care should be taken while choosing a NUPI so that the distribution of data is almost even .
  • UPI/NUPI decision should be taken based on the data and its usage.
  • One Primary Index per table.
  • A Primary Index value can be unique or non-unique.
  • The Primary Index value can be NULL.
  • The Primary Index value can be modified.
  • The Primary Index of a populated table cannot be modified.
  • A Primary Index has a limit of 64 columns.
  • By default it takes as first column as PI

216. In the Primary Index, what is the score of AMPs that are actively involved?

Only one AMP is actively involved in a Primary Index.

 217.Can we have several Primary Index on a table?

No we cannot have multiple Primary Index on a table. However we can create 32 Secondary Indexes on a table. 

218.Can we define PARTITION PRIMARY INDEX (PPI) on a Primary Index column in Table? Explain Why?

 PPI cannot be defined on PI column in Table. Since PI is used to locate an AMP and store the data based on HASH value (ROW ID ) it cannot be used as PPI column. In case of PPI , The data stored in AMP’s are Partitioned based on PPI column after they are row hashed (ROW KEY = ROW ID +PPI value ) P.S: If you want to create UPI on table, then PPI column can be added as part of PI .

219. How many types of Index are present in teradata?

 There are 5 different indices present in Teradata

  • 1. Primary Index a.Unique primary index b. non Unique primary index
  • 2. Secondary Index a. Unique Secondary index b. non Unique Secondary index
  • 3. Partitioned Primary Index a. Case partition (ex. age, salary…) b. range partition ( ex. date)
  • 4. Join index
    •  Single table join index
    • multiple table join index
    •  Sparse Join index ( constraint applied on join index in where clause)
    • aggregate index
  • 5. Hash index

220.What are Secondary Indexes (SI) , types of SI and disadvantages of Secondary Indexes in Teradata?

  • Secondary Indexes provide another path to access data.
  • Teradata allows up to 32 secondary indexes per table.
  • Keep in mind; row distribution of records does not occur when secondary indexes are defined.
  • The value of secondary indexes is that they reside in a subtable and are stored on all AMPs, which is very different from how the primary indexes (part of base table) are stored.
  • Keep in mind that Secondary Indexes (when defined) do take up additional space. Secondary Indexes are frequently used in a WHERE clause.
  • The Secondary Index can be changed or dropped at any time.
  • However, because of the overhead for index maintenance, it is recommended that index values should not be frequently changed.
  • There are two different types of Secondary Indexes, Unique Secondary Index (USI), and Non-Unique Secondary Index (NUSI).
  • Unique Secondary Indexes are extremely efficient. A USI is considered a two-AMP operation. One AMP is utilized to access the USI subtable row (in the Secondary Index subtable) that references the actual data row, which resides on the second AMP.
  • A Non-Unique Secondary Index is an All-AMP operation and will usually require a spool file. Although a NUSI is an All-AMP operation, it is faster than a full table scan.

Secondary indexes can be useful for:

  • Satisfying complex conditions
  • Processing aggregates
  • Value comparisons
  • Matching character combinations
  • Joining tables

: Advantages:

  1. A secondary index might be created and dropped dynamically
  2.      table may have up to 32 secondary indexes.
  3. Secondary index can be created on any column. .Either Unique or Non-Unique
  4. It is used as alternate path or Least frequently used cases.  ex. defining SI on non indexed column can improve the performance, if it is used in  join or filter condition of a given query.
  5. Collecting Statistics on SI columns make sure Optimizer chooses SI if it is better than doing Full Table Scans

Disadvantages

  1. Since Sub tables are to be created, there is always an overhead for additional spaces.
  2. They require additional I/Os to maintain their sub tables.
  3. The Optimizer may, or may not, use a NUSI, depending on its selectivity.
  4. If the base table is Fallback, the secondary index sub table is Fallback as well.
  5. If statistics are not collected accordingly, then the optimizer would go for Full Table Scan.
  6. Secondary Indexes need more I/O.
  7. Secondary Index slows down the process because of Sub-table maintenance.
  8. Secondary Index requires more Physical Storage.

221.How are the data distributed in Secondary Index Subtables in Teradata?

  • When a user creates a Secondary Index, Teradata automatically creates a Secondary Index Subtable.
    • The subtable will contain the:
    • Secondary Index Value
    • Secondary Index Row ID
    • Primary Index Row ID
  • When a user writes an SQL query that has an SI in the WHERE clause, the Parsing Engine will Hash the Secondary Index Value. The output is the Row Hash, which points to a bucket in the Hash Map.
  • That bucket contains an AMP number and the Parsing Engine then knows which AMP contains the Secondary Index Subtable pertaining to the requested USI information. The PE will direct the chosen AMP to look-up the Row Hash in the Subtable.
  • The AMP will check to see if the Row Hash exists in the Subtable and double check the subtable row with the actual secondary index value.
  • Then, the AMP will pass the Primary Index Row ID back up the BYNET network. This request is directed to the AMP with the base table row, which is then easily retrieved.

222.  What are the things to considered while creating secondary index?

  •   Creating a secondary index causes Teradata to build a sub-table to contain its index rows, thus adding another set of rows that requires updating each time a table row is inserted, deleted, or updated.
  • Secondary index sub-tables are also duplicated whenever a table is defined with FALLBACK, so the maintenance overhead is effectively doubled.

223.  What is primary index and secondary index?Answer:     

  • Primary index is the mechanism for assigning a data row to an AMP and a location on the AMP’s disks.
  • Indexes also used to access rows from a table without having to search the entire table.
  • Secondary indexes enhance set selection by specifying access paths less frequently used than the primary index path.
  • Secondary indexes are also used to facilitate aggregate operations. If a secondary index covers a query, then the Optimizer determines that it would be less costly to accesses its rows directly rather than using it to access the base table rows it points to.

224. What is the purpose of indexes? Answer:  

  • An index is a mechanism that can be used by the SQL query optimizer to make table access more performant.
  • Indexes enhance data access by providing a more-or-less direct path to stored data and avoiding the necessity to perform full table scans to locate the small number of rows you typically want to retrieve or update. 

225.How does indexing improve query performance?

  • By creating index on single or multiple columns, query takes index access path instead of full table scan.
  • Full table scan is costilier operation where it has to visit all rows in the table whereas index operation just accesses the index sub table with required hash value and joins with the basetable using rowid. This improves response time and performance of the query.
  • Teradata automatically Creates the Primary index , if not specified by the DDl for Table .
    • PI
    • Secondary Index
    • PPI
    • Join Index
    • HASH Index
  • Indexing is a way to physically reorganise the records to enable some frequently used queries to run faster. The index can be used as a pointer to the large table.
  • It helps to locate the required row quickly and then return it back to the user.
  • or The frequently used queries need not hit a large table for data. they can get what they want from the index itself. – cover queries.
  • Index comes with the overhead of maintanance.
  • Teradata maintains its index by itself. Each time an insert/update/delete is done on the table the indexes will also need to be updated and maintained.
  • Indexes cannot be accessed directly by users. Only the optimizer has access to the index.
226.What are the differences between the followings? – Vertical & Horizontal Partitioning vs – Join & Hash Indexes vs – PPI
  • Horizontal partitioning is used in large tables where no of rows are very high.
  • Rows are broken and placed in partitions.
  • Loading becomes faster and query performance  would be good.
  • Vertical partitioning is used for tables with huge no of columns.
  • Column wise data  are broken and placed in partitions with a common key.
  • A multi/aggregate join index basically stores the results of the join in perm space.
  • A hash index basically stores the results of the table based on the hash value.
  • PPI basically partitions the data based on the column that is specified in the partition clause.

227.What is Value-Ordered NUSIs ?

  • When we create a NUSI on the table a subtable is built on all AMP’s.
  • Each Subtable contains a row for the NUSI and corresponding row ids to the base rows located on the same AMP.
  • Rows in the subtable are sequenced by the row hash of the NUSI value.
  • This way of storing rows is convenient in storing rows with particular NUSI, but not good for range queries.
  • This is where Value Ordered NUSI becomes useful VALUE ordered NUSI allows rows in the subtable to be stored in order of data value , rather than hash of the value. This is useful when performing range queries.

228.Creating value ordered NUSI when creating table.

Example 1:
CREATE TABLE EMPLOYEE1
(
Employeeid integer,
DepartmentNo integer,
Salary decimal(8,2),
Hiredate date
) INDEX (Departmentno) ORDER BY VALUES;

Doing a show table will show below:

SHOW TABLE EMPLOYEE1

CREATE SET TABLE EDW_RESTORE_TABLES.EMPLOYEE1 ,NO FALLBACK ,
   NO BEFORE JOURNAL,
   NO AFTER JOURNAL,
   CHECKSUM = DEFAULT,
   DEFAULT MERGEBLOCKRATIO
   (
   Employeeid INTEGER,
   DepartmentNo INTEGER,
   Salary DECIMAL(8,2),
   Hiredate DATE FORMAT 'YYYY/MM/DD')
PRIMARY INDEX ( Employeeid )
INDEX ( DepartmentNo ) ORDER BY VALUES ( DepartmentNo ); 
--> note that DEPARTMENTNO is used to order. 
Here since we had only one column in the index, it is used for ordering.

However when we have index made of multiple columns we can choose which column to use fo ordering.

Example 2:

CREATE TABLE EMPLOYEE1
(
Employeeid integer,
DepartmentNo integer,
Salary decimal(8,2),
Hiredate date
) INDEX (Departmentno,Employeeid) ORDER BY VALUES;
/* Multiple columns as part of INDEX*/



SHOW TABLE EMPLOYEE1

CREATE SET TABLE EDW_RESTORE_TABLES.EMPLOYEE1 ,NO FALLBACK ,
   NO BEFORE JOURNAL,
   NO AFTER JOURNAL,
   CHECKSUM = DEFAULT,
   DEFAULT MERGEBLOCKRATIO
   (
   Employeeid INTEGER,
   DepartmentNo INTEGER,
   Salary DECIMAL(8,2),
   Hiredate DATE FORMAT 'YYYY/MM/DD')
PRIMARY INDEX ( Employeeid )
INDEX ( Employeeid ,DepartmentNo ) ORDER BY VALUES ( DepartmentNo ); 
--> by default system choose the 1st column DEpartmentNo for ordering.


In case of multiple columns being part of index 
we can specify the column which we would want to be used 
for ordering as shown in below example:

Example 3:

CREATE TABLE EMPLOYEE1
(
Employeeid integer,
DepartmentNo integer,
Salary decimal(8,2),
Hiredate date
) INDEX (Departmentno,Employeeid) ORDER BY VALUES(employeeid);


Creating Value ordered NUSI using CREATE INDEX syntax

CREATE INDEX(DepartmentNO) ORDER BY VALUES (DepartmentNO) on EMPLOYEE ;


Rules for using value ordered NUSI

The ordering can be done only on 1 single column.

In below we try to create a value ordered NUSI with two columns

CREATE TABLE EMPLOYEE1
(
Employeeid integer,
DepartmentNo integer,
Salary decimal(8,2),
Hiredate date
) INDEX (Departmentno,Employeeid) ORDER BY VALUES(Departmentno,Employeeid) ;

Following is the error message we get:

CREATE TABLE Failed. 5466: Error in Secondary Index DDL, Order by clause can have only one column. 

Column used for ordering must a part or all of the index definitions

In the following query we are ordering by a column that is not a part of index definition

CREATE TABLE EMPLOYEE1
(
Employeeid integer,
DepartmentNo integer,
Salary decimal(8,2),
Hiredate date
) INDEX (Departmentno,Employeeid) ORDER BY VALUES(salary) ;

Following is the error message we get:

CREATE TABLE Failed. 5466: Error in Secondary Index DDL,
 Order by field does not belong to the index. 


The column should be numeric- non numeric columns cannot be used.

Column should not be greater than 4 bytes. Thus only INT,
 SMALLINT, BYTEINT, DATE, DEC are valid. A Decimal is 
also allowed provided its storage length does not exceed 
4 bytes and it does not have any precision digits.
)
229.I have not defined PRIMARY INDEX or PRIMARY KEY on table what will happen now: In this case,

a) I have not defined PRIMARY INDEX or PRIMARY KEY on table what will happen now: In this case, Teradata will check if any column is defined as UNIQUE, then it will make that column as UNIQUE PRIMARY INDEX else first column will be created as PRIMARY INDEX.

b) I have not defined PRIMARY INDEX however a column is defined as PRIMARY KEY: In this case, Teradata will make the PRIMARY KEY column as UNIQUE PRIMARY INDEX of the table.

c) I have defined both PRIMARY KEY and PRIMARY INDEX on different column: In this case, Teradata will make PRIMARY KEY column as UNIQUE SECONDARY INDEX i.e. UNIQUE INDEX on the table. So one must understand the importance of PRIMARY INDEX in Teradata. Generally, PRIMARY KEY concept is taken care by UNIQUE PRIMARY INDEX in Teradata environment.

230.How Teradata can accesses data ?
  • Primary Index (fastest)
  • Secondary Index (second fastest way)
  • Full Table Scan (slowest way)

Primary Index (fastest) – When ever a Primary Index is utilized in the SQL WHERE Clause the PE will be able to use the Primary Index to get the data with a one- AMP operation.

Secondary Index (next fastest) -If the Primary Index is not utilized sometimes Teradata can utilize a secondary index.It is not as fast as the Primary Index, but it is much faster than a full table scan.

Full Table Scan (FTS) (Slowest) Teradata handles full table scans brilliantly because Teradata accesses eachdata row only once because of the parallel processing. Full Table Scans are a way to access Teradata without using an index.Each data block per table is read only once.

231.What are the reasons for product joins ?

  • 1.Stale or no stats causing optimizer to use product join
  • 2. Improper usage of aliases in the query.
  • 3. missing where clause ( or Cartesian product join  1=1 )
  • 4. non equality conditions like > ,< , between   example ( date)
  • 5. few join conditions
  • 6.  when or conditions are used. last but not the least   product joins are not bad always!! sometimes PJ are better compared to other types of joins.

233.Create Table: When you are designing/creating the table, check the below steps.

  • Always create Multiset table.If you have a requirement to create a SET table then make sure that you have a unique primary index or unique secondary index on that table.
  • If there is no UPI or USI on the SET table then the new row will check every
  • row with same row hash value byte-by-byte to make sure that a duplicate doesn’t insert into the SET table.
  • Choose the primary index that has even distribution and high join access. If the column has unique values and is not a primary index then create Unique Secondary Index. But Fastload and Multiload will not work if the table has Unique Secondary  index.
  • Do not create any UNICODE columns unless absolutely required. UNICODE requires double the storage space and stats often become unusable for Teradata.
  • If the table has no updates then you can change default Free Space Percent Parameter of a table to 0 so that the data block can hold many rows and then all rows can be retrieved in a single I/O.
  • Try to create vertical partition if the table has too many columns and some of the columns are used very rare.If you reduce the table size then the data block can hold many rows so I/O can be more efficient.
  • If the 90% of users queries data for current year then horizontal partition gives best performance.
  • You can split the table by creating two tables history and current.
  • Another approach is by creating partition. Teradata partition reduces the overhead of scanning the complete table thus improving performance. Choose the partition table when is accessed by date ranges or character columns. Make sure that number of partitions is less.
  • Compression reduces the storage capacity and the number of I/O’s required for an operation. Wherever possible, do the compression.
  • If there is a join column in a table and  has more null values then replace   the null values with some random number using random function and then multiply with -1 then those records can be easily distinguished from other records. Better to give random range like 1 to 10000 so it will not impact distribution if there is an upgrade in near future. Business users don’t want to see negative or unknown members in the reports so replace negative values with null in an application layer.
  • Try to define all join columns with not null at table level even though source is providing null values. ETL logic should populate it with some random values if they are null so there will be no  data skew issues or spool out error in applications.
  • Wherever possible try to avoid outer joins. We can avoid outer joins by inserting some unknown (dummy) records into the dimension (master) table. Same range values will be inserted into the FACT (Transaction) table wherever there is a null in corresponding column.
  • Note down that there is a difference between create table (copy) and create table (select). Create table (copy) retains all definitions and indexes where as create table (select) not.

TERADATA JOIN INDEX

234 .What is a join index ? What are benefits of using join index ?

  • Join Indexes are created so data doesn’t have to move to satisfy the join.
  • The Join Index essentially pre-joins the table and keeps it hidden for the Parsing Engine to utilize.
  • Remember, the user cannot select directly from the Join Index.
  • The PE decides when it is to be used whenever the user queries the base table or the views of the base table.
  • Join Indexes are physically stored exactly like normal Teradata tables.•
  • Users can’t query the Join Index directly, but PE will decide when to use.
  • Join Indexes are automatically updated as base tables change.
  • Join Indexes can have Non-Unique Primary Indexes (NUPI) only.
  • Join Indexes can have Non-Unique Secondary (NUSI) Indexes.
  • Max 64 Columns per Table per Join Index
  • BLOB and CLOB types cannot be defined.
  • Triggers with Join Indexes allowed V2R6.2
  • After Restoring a Table, Drop and Recreate the Join Index.FastLoad/ MultiLoad won’t load to tables with a Join Index defined
  • It is a index that is maintained in a system .
  • It maintains rows joined on two or more tables.
  • Join index is useful for queries where the index structure contains all the columns referenced by one or more joins, thereby allowing the index to cover all or part of the query

235.Benefits if using join index is

  • To eliminate base table access.
  • Aggregate processing is eliminated by creating aggregate join index.
  • It reduces redistribution of data since data is materialized by JI.
  • Reduces complex join conditions by using covering queries. 

236.What are Teradata Join Indexes best suited for?

  • Frequent joins of n middle to large tables with a significant number of rows from both tables being joined.
  • Frequent joins of large tables and a relatively small set of columns is repeatedly requested.
  • For frequently run queries with complex expression in its predicate.
  • For join indexes assistance in de-normalization.
  • Perfectly suited for alternative partitioning.
  • Very useful when alternate keys are used a lot for joining.
  • Move large and time-consuming joins or aggregations into the batch windows.
  • Join indexes even allow direct access via the primary index of the JI.

237.The Teradata Join Index and Partitioning

  • The Join Index can be partitioned
  • A Partitioned JI for a non-partitioned base table is ok, but non-partitioned JI on partitioned base table not recommended from a performance point of view
  • You cannot specify a partition for a row compressed Join Index

 

238.Full or partial Coverage of the Join IndexIf the join index is not fully covering, it can only be used to locate the ROWID of the base table data record. A covering join index is a special case, where the join index contains all required columns and can directly satisfy the query without accessing the base table.

First of all, it is important to know:

There is no full coverage for SELECT * FROM table_name queries.

Avoid the practice of writing your SQL queries, without specifying a column list, if you want to achieve full join index coverage.
The usage of the join index with partial coverage is possible in the following situations:

  • One of the columns in the join index definition is the keyword ROWID. In this case, the base table ROWID is stored with each index row and a lookup of the base table row can be done.
  • The column set of the UPI of the underlying table is part of the definition. In this case the column combination can be hashed and the ROWID of the base table row can be derived
  • The column set of the NUPI of the underlying table is part of the definition plus either
    (a) one of the columns in the definition of that index is the keyword ROWID or
    (b) the column set defining a USI on the underlying base table.

For (a) like in the case of the USI coverage, the base table rows can be accessed via the ROWID

For (b), similar to the UPI case described above, just that the base table rows can be accessed indirectly by hashing the USI columns,                       extracting the base table ROWIDs from the USI into the spool and retrieving the base table rows by scanning this spool.

Coverage does not guarantee use of a Join Index. The cost of using must be estimated to be less than the cost of not using!

239.Restrictions on SQL Operations

  • FULL OUTER JOIN not allowed
  • LEFT or RIGHT JOINS: on the inner side at least 1 non-nullable column
  • OUTER JOIN preferable for Join Index usage likelihood, but not allowed for Aggregate JI
  • HAVING and QUALIFY not allowed
  • No Set Operations: UNION, INTERSECT, MINUS
  • No Sub queries
  • No inequality conditions for ON clauses in join index definitions. Supported only if ANDed to at least one equality join condition
  • Only <, <=, >=, > as comparison operators in join allowed
  • TOP n and SAMPLE not allowed

SELECT * FROM dbc.indices WHERE indextype=’J’;

240.TYPES OF JOINS INDEX  

 
Multi-Table Join Index
A Multi-Table Outer Join Index has some very specific rules above to remember.
The Outer Join Index has the additional rows that did NOT match.
 
Compressed Multi-Table Join Index
  • A Compressed Multi-Table Join Index won’t keep repeating the same Customer_Number and Customer_Name, but only list it once.
  •  compressed Join Index (such as the example above) is an example of Row compression because it eliminates space by not duplicating the Customer_Number or Customer_Name. Some rows are essentially smaller and this is a form of compression!
 
 Single-Table Join Index
  • If a USER queries with the Dept_No in the WHERE clause this will be a Single-AMP retrieve.
  • If the USER joins the Employee and Department Tables together then Teradata won’t need to Redistribute or Duplicate to get the data AMP local. 
Single Table Join Index Great For LIKE Clause
  • Build a STJI with column that contains three columns. They are the LIKE column being queried, the primary index column of the base table and the keyword ROWID!
  • The PE will choose to scan the narrow table (the Join Index) and qualify all rows that qualify with a car license LIKE ‘TeraT%’,
  • then the PE uses the ROWID to get data from the BMV_Table where row is on the same AMP because both the base table and join index are on the same AMP because they both have the same Primary Index.
  • This can save enormous time for queries using the LIKE command.
  • A LIKE command on a base table will never use a Non-Unique Secondary Index (NUSI).
  • The above technique should be tested and only used if a lot of users are utilizing the LIKE command on a large table. If that is the case a lot of time can be saved.

Single Table Join Index with Value Ordered NUSI

  • A Value Ordered NUSI can only be done on columns that are 4-byte integers.
  • Dates qualify because they are stored internally in Teradata as 4-byte integers.
  • A value ordered index has been expanded from 16 to 64 columns.
  • Indexes are always sorted by their hash, but a Value Ordered index is sorted on each AMP by values and not hash.

Aggregate Join Indexes

  • Aggregate Join Indexes may be defined on:
    • Single Tables – A columnar subset of a base table with aggregates automatically maintained by Teradata.
    • Multiple Tables -A columnar subset of as many as 64 base tables with aggregate columns automatically maintained by Teradata.
  • Sparse Join Indexes are defined with a WHERE clause that limits the number of base table rows included and the space required to store them.
  • Aggregate Join Indexes can only include SUM and COUNT values.
  • You can calculate Averages from these two columns though.
Sparse Join Index
A Sparse Join Index has a WHERE clause so it doesn’t take all the rows in the table, but only a portion. This is a very effective way to save space and focus on the latest data.
Global Multi-Table Join Index
With the ROWID inside the Join Index, the PE can get columns in the User’s SQL NOT specified in the Join Index directly from the Base Table by using the Row-ID.
 Hash Index
A Hash Index can be Ordered by Values or by Hash.
  • The storage of a vertical subset (columns) of the table in the hash index structure
  • The selection of a better Primary Index in order to avoid costly redistribution activities for join preparation.
  • The Ordering of rows by value, to support range scans
  • Hash indexes are similar to single-table simple join indexes in that they are used for denormalizing a single table by storing rows in a separate physical object from the base table.
  • Hash indexes are limited to a SINGLE table and are beneficial in improving efficiency. Like join indexes, Hash index rows are stored in a different object than the base table rows.
  • This allows for faster access without needing access to the underlying rows. Also like their relative, they function as an alternative to the base rows instead of providing a different access path.
  • Like join indexes, hash indexes are known as “covered queries” because they are utilized when they contain all the columns requested by the query. If all of the columns are covered, the optimizer will usually choose to access the rows via the hash index instead of the base table rows.
  • In a situation where the hash index partially covers the query, the optimizer may use the Row ID included with the Hash Index to access the other columns in the data row. Join indexes and hash indexes are both transparently maintained by Teradata. When the base table changes, the Hash Index table is automatically updated.
  • This automatic task is referred to as an update burden.
  • Being that Hash Indexes are strikingly similar in functionality to secondary indexes, they should be carefully considered because they carry this update burden.
  • Hash indexes however, can offer marginal performance gains over a secondary index when certain columns are frequently needed from one table to join with many others. In these situations, the Hash index partially covers the query, but fully covers the rows from the base table, effectively eliminating any need to access the base table.
  • Hash Indexes: • Can be ordered by hash or by values to facilitate range queries. • Automatically has the Row ID of the base table row included in the Hash Index which the RDBMS software can use to access columns not included in a “covered query”

Maybe the most outstanding limitations compared with a Single Table Join Index are:

  • – A Hash Index cannot have a Partitioned Primary Index
    – A Hash Index cannot have a Non-Unique Secondary Index.
    – Hash Indexes cannot be specified for NOPI or column‑partitioned base tables as they are designed around the Teradata hashing algorithm.
    – A hash index cannot be column partitioned
    –  A hash index must have a Primary Index, a Single Table Join Index can be created with or without a primary index if a table is column-partitioned (as column stores on Teradata never have a Primary Index)

Hash Indexes are base tables that cannot be accessesd directly by a query.The differences between hash and single table join indexes are:1)Hash indexes cannot have secondary index,but single table join indexes can have secondary index.2)Hash index column list cannot have aggreagete functions,but single table join index supports aggreagete function with column lis

The optimizer includes a hash index in a query plan when the index covers all or part of a join query thus eliminating the need to redistribute the rows.However hash index carries more burden than a secondary index.And the storage of a hash index is similar to a base table storage except that hash indexes can be compressed.

http://www.dwhpro.com/the-teradata-hash-index/