TERADATA INDEX

212.How to Choose Primary Index(PI) in Teradata?

We can choose PI based on below criteria

Frequently using column.
Columns which have more unique value
Columns which are using in join clause
Not regularly changing column values
Access Demographics Identify index candidates that maximize one-AMP operations. Columns most frequently used for access (Value and Join).
Distribution Demographics Identify index candidates that optimize parallel processing. Columns that provide good distribution
Volatility Demographics Identify index candidates with low maintenance I/O.

213. PPI(Partition Primary Index).

PPI is used to improve performance for large tables when you submit queries that specify a range constraint.
PPI allows you to reduce the number of rows to be processed by using partition elimination.
PPI will increase performance for incremental data loads, deletes, and data access when working with large tables with range constraints

Lets take Order_Table, where we have both January and February dates in column Order_Date.

The Order_Table spread across the AMPs.Notice that January and February dates are mixed on every AMP in what is a random order. This is because the Primary Index is Order_Number.

When we apply Range Query, that means it uses the keyword BETWEEN.

The BETWEEN keyword in Teradata means find everything in the range BETWEEN this date and this other date. We had no indexes on Order_Date so it is obvious the PE will command the AMPs to do a Full Table Scan. To avoid full table scan, we will Partition the table.

After Partitioned Table,
The example of AMPs on the top of the page. This table is not partitioned.

The example of AMPs on the bottom of the page. This table is partitioned.

Each AMP always sorts its rows by the Row-ID in order to do a Binary Search on Primary Index queries.

Notice that the rows on an AMP don‘t change AMPs because the table is partitioned. Remember it is the Primary Index alone that will determine which AMP gets a row. If the table is partitioned then the AMP will sort its rows by the partition.

Now we are running our Range Query on our Partitioned Table,each AMP only reads from one partition. The Parsing Engine will not to do full table scan. It instructs the AMPs to each read from their January Partition. You Partition a Table when you CREATE the Table.

A Partitioned Table is designed to eliminate a Full Table Scan, especially on Range Queries.

Types of partitioning:

RANGE_N Partitioning

Below is the example for RANGE_N Partition by day.
CREATE TABLE ORDER_TABLE

(

ORDER_NO INTEGER NOT NULL,

CUST_NO INTERGER,

ORDER_DATE DATE,

ORDER_TOTAL DECIMAL(10,2)

)

PRIMARY INDEX(ORDER_NO)

PARTITION BY RANGE_N

(ORDER_DATE BETWEEN

DATE ‘2012-01-01’ AND DATE ‘2012-12-31’

EACH INTERVAL ‘7’ DAY);

Case_N Partitioning
CREATE TABLE ORDER_TABLE

(

ORDER_NO INTEGER NOT NULL,

CUST_NO INTERGER,

ORDER_DATE DATE,

ORDER_TOTAL DECIMAL(10,2)

)

PRIMARY INDEX(ORDER_NO)

PARTITION BY CASE_N

(ORDER_TOTAL < 1000,

ORDER_TOTAL < 2000,

ORDER_TOTAL < 5000,

ORDER_TOTAL < 10000,

ORDER_TOTAL < 20000,

NO CASE, UNKNOWN);

The UNKNOWN Partition is for an Order_Total with a NULL value. The NO CASE Partition is for partitions that did not meet the CASE criteria.

For example, if an Order_Total is greater than 20,000 it wouldn‘t fall into any of the partitions so it goes to the NO CASE partition.

Multi-Level Partitioning:
You can have up to 15 levels of partitions within partitions.

CREATE TABLE ORDER_TABLE

(

ORDER_NO INTEGER NOT NULL,

CUST_NO INTERGER,

ORDER_DATE DATE,

ORDER_TOTAL DECIMAL(10,2)

)

PRIMARY INDEX(ORDER_NO)

PARTITION BY (RANGE_N

(ORDER_DATE BETWEEN

DATE ‘2012-01-01’ AND DATE ‘2012-12-31’

EACH INTERVAL ‘1’ DAY)

CASE_N (ORDER_TOTAL < 5000,

ORDER_TOTAL < 10000,

ORDER_TOTAL < 15000,

ORDER_TOTAL < 20000,

NO CASE, UNKNOWN));

Character Based Partitioning(New Feature V13.10) :
There are four new data types available for Character Based PPI. They are CHAR, VARCHAR, GRAPHIC, and VARGRAPHIC.

Example for RANGE Based Character PPI

CREATE TABLE EMP_TBL

(

EMP_NO INTEGER NOT NULL,

DEPT_NO INTEGER,

FIRST_NAME CHAR(20),

LAST_NAME VARCHAR(20),

SALARY DECIMAL(10,2),

) PRIMARY INDEX(EMP_NO)

PARTITION BY RANGE_N

(LAST NAME BETWEEN ( ‘A ‘,’B ‘,’C ‘,’D ‘,’E ‘,’F ‘,’G ‘,’H ‘,

‘I ‘,’J ‘,’K ‘,’L ‘,’M ‘,’N ‘,’O ‘,’P ‘,’Q ‘,’R ‘,’S ‘,’T ‘,

‘U ‘,’V ‘,’W ‘,’X ‘,’Y ‘,’Z ‘ AND ‘ZZ’,UNKNOWN));

Example for CASE Based Character PPI

CREATE TABLE PRODUCT_TBL

(PRODUCT_ID INTEGER NOT NULL

,PRODUCT_NAME CHAR(30)

,PRODUCT_COST DECIMAL(10,2)

,PRODUCT_DESCRIPTION VARCHAR(100)

)PRIMARY INDEX(PRODUCT_ID)

PARTITION BY CASE_N

(PRODUCT_NAME < ‘Apples’

PRODUCT_NAME < ‘Bananas’

PRODUCT_NAME < ‘Cantaloupe’

PRODUCT_NAME < ‘Grapes’

PRODUCT_NAME < ‘Lettuce’

PRODUCT_NAME < ‘Mangos’

PRODUCT_NAME >=’Mangos’ and <=’Tomatoes’);

Ex-Query: Find all Products between Apples and Grapes?

Ans: SELECT * FROM PRODUCT_TBL WHERE PRODUCT_NAME BETWEEN ‘Apples’ and ‘Grapes’;

Partitioning Rules:

A table can have up to 65,535 Partitions.
Partitioning never determines which AMP gets row.
Partitioning determines how an AMP will sort the row on its own.
Table can have up to 15 levels of partitions.
A table cannot have an UPI as the Primary Index if the Partition table does not include PI.
Total 3 forms of Partitioning Simple, RANGE and CASE.

Advantages of partitioned tables:

They provide efficient searches by using partition elimination at the various levels or combination of levels.
They reduce the I/O for range constraint queries
They take advantage of dynamic partition elimination
They provide multiple access paths to the data, and an MLPPI provides even more partition elimination and more partitioning expression choices, (i.e., you can use last name or some other value that is more readily available to query on.)
The Primary Index may be either a UPI or a NUPI; a NUPI allows local joins to other similar entities
Row hash locks are used for SELECT with equality conditions on the PI columns.
Partitioned tables allow for fast deletes of data in a partition.
They allow for range queries without having to use a secondary index.
Specific partitions maybe archived or deleted.
May be created on Volatile tables; global temp tables, base tables, and non-compressed join indexes.
May replace a Value Ordered NUSI for access.
Tables that hold yearly information don’t have to be split into 12 smaller tables to avoid Full Table Scans (FTS). This can make modeling and querying easier.
Fastload and Multiload work with PPI tables, but not with all Secondary Indexes.
Range queries don’t have to utilize a Full Table Scan.
Deletions of entire partitions are lightning fast.
PPI provides an excellent solution instead of using Secondary Indexes

Disadvantage of partitioned tables:

A two-byte Partition number is added to the ROW-ID and it is now called a ROW KEY. The two-bytes per row will add more Perm Space to a table.
Joins to Non-Partitioned Tables can take longer and become more complicated for Teradata to perform.
Basic select queries utilizing the Primary Index can take longer if the Partition number is not also mentioned in the WHERE clause of the query.
You can’t have a Unique Primary Index (UPI) if the Partition Number is not at least part of the Primary Index. You must therefore create a Unique Secondary Index to maintain uniqueness.
Access via the Primary Index may take longer
Full table joins to a NPPI table with the same PI may take longer

214.Highlight the differences between Primary Key and Primary Index.

Primary index is quite mandatory, whereas Primary Key is optional.
Primary Index has a limit of 64 tables/columns, whereas Primary Key does not have any limit.
Primary Index allows duplicates and nulls, whereas Primary Key doesn’t.
Primary Index is a physical mechanism, whereas Primary Key is purely logical mechanism.
Primary Key is a concept that uniquely identifies a particular row of a table.
Primary INDEX is used for finding best access path for data retrieval and data insertion and Primary KEY is used for finding each rows uniquely just like in other RDBMS.

215.What are the types of PI (Primary Index) in Teradata?

There are two types of Primary Index.

Unique Primary Index ( UPI) and

Non Unique Primary Index (NUPI).

By default, NUPI is created when the table is created.
Unique keyword has to be explicitly given when UPI has to be created.
UPI will slower the performance sometimes as for each and every row , uniqueness of the column value has to be checked and it is an additional overhead to the system but the distribution of data will be even.
Care should be taken while choosing a NUPI so that the distribution of data is almost even .
UPI/NUPI decision should be taken based on the data and its usage.
One Primary Index per table.
A Primary Index value can be unique or non-unique.
The Primary Index value can be NULL.
The Primary Index value can be modified.
The Primary Index of a populated table cannot be modified.
A Primary Index has a limit of 64 columns.
By default it takes as first column as PI

216. In the Primary Index, what is the score of AMPs that are actively involved?

Only one AMP is actively involved in a Primary Index.

217.Can we have several Primary Index on a table?

No we cannot have multiple Primary Index on a table. However we can create 32 Secondary Indexes on a table.

218.Can we define PARTITION PRIMARY INDEX (PPI) on a Primary Index column in Table? Explain Why?

PPI cannot be defined on PI column in Table. Since PI is used to locate an AMP and store the data based on HASH value (ROW ID ) it cannot be used as PPI column. In case of PPI , The data stored in AMP’s are Partitioned based on PPI column after they are row hashed (ROW KEY = ROW ID +PPI value ) P.S: If you want to create UPI on table, then PPI column can be added as part of PI .

219. How many types of Index are present in teradata?

There are 5 different indices present in Teradata

1. Primary Index a.Unique primary index b. non Unique primary index
2. Secondary Index a. Unique Secondary index b. non Unique Secondary index
3. Partitioned Primary Index a. Case partition (ex. age, salary…) b. range partition ( ex. date)
4. Join index
- Single table join index
- multiple table join index
- Sparse Join index ( constraint applied on join index in where clause)
- aggregate index
5. Hash index

220.What are Secondary Indexes (SI) , types of SI and disadvantages of Secondary Indexes in Teradata?

Secondary Indexes provide another path to access data.
Teradata allows up to 32 secondary indexes per table.
Keep in mind; row distribution of records does not occur when secondary indexes are defined.
The value of secondary indexes is that they reside in a subtable and are stored on all AMPs, which is very different from how the primary indexes (part of base table) are stored.
Keep in mind that Secondary Indexes (when defined) do take up additional space. Secondary Indexes are frequently used in a WHERE clause.
The Secondary Index can be changed or dropped at any time.
However, because of the overhead for index maintenance, it is recommended that index values should not be frequently changed.
There are two different types of Secondary Indexes, Unique Secondary Index (USI), and Non-Unique Secondary Index (NUSI).
Unique Secondary Indexes are extremely efficient. A USI is considered a two-AMP operation. One AMP is utilized to access the USI subtable row (in the Secondary Index subtable) that references the actual data row, which resides on the second AMP.
A Non-Unique Secondary Index is an All-AMP operation and will usually require a spool file. Although a NUSI is an All-AMP operation, it is faster than a full table scan.

Secondary indexes can be useful for:

Satisfying complex conditions
Processing aggregates
Value comparisons
Matching character combinations
Joining tables

: Advantages:

A secondary index might be created and dropped dynamically
table may have up to 32 secondary indexes.
Secondary index can be created on any column. .Either Unique or Non-Unique
It is used as alternate path or Least frequently used cases. ex. defining SI on non indexed column can improve the performance, if it is used in join or filter condition of a given query.
Collecting Statistics on SI columns make sure Optimizer chooses SI if it is better than doing Full Table Scans

Disadvantages

Since Sub tables are to be created, there is always an overhead for additional spaces.
They require additional I/Os to maintain their sub tables.
The Optimizer may, or may not, use a NUSI, depending on its selectivity.
If the base table is Fallback, the secondary index sub table is Fallback as well.
If statistics are not collected accordingly, then the optimizer would go for Full Table Scan.
Secondary Indexes need more I/O.
Secondary Index slows down the process because of Sub-table maintenance.
Secondary Index requires more Physical Storage.

221.How are the data distributed in Secondary Index Subtables in Teradata?

When a user creates a Secondary Index, Teradata automatically creates a Secondary Index Subtable.
- The subtable will contain the:
- Secondary Index Value
- Secondary Index Row ID
- Primary Index Row ID
When a user writes an SQL query that has an SI in the WHERE clause, the Parsing Engine will Hash the Secondary Index Value. The output is the Row Hash, which points to a bucket in the Hash Map.
That bucket contains an AMP number and the Parsing Engine then knows which AMP contains the Secondary Index Subtable pertaining to the requested USI information. The PE will direct the chosen AMP to look-up the Row Hash in the Subtable.
The AMP will check to see if the Row Hash exists in the Subtable and double check the subtable row with the actual secondary index value.
Then, the AMP will pass the Primary Index Row ID back up the BYNET network. This request is directed to the AMP with the base table row, which is then easily retrieved.

222. What are the things to considered while creating secondary index?

Creating a secondary index causes Teradata to build a sub-table to contain its index rows, thus adding another set of rows that requires updating each time a table row is inserted, deleted, or updated.
Secondary index sub-tables are also duplicated whenever a table is defined with FALLBACK, so the maintenance overhead is effectively doubled.

223. What is primary index and secondary index?Answer:

Primary index is the mechanism for assigning a data row to an AMP and a location on the AMP’s disks.
Indexes also used to access rows from a table without having to search the entire table.
Secondary indexes enhance set selection by specifying access paths less frequently used than the primary index path.
Secondary indexes are also used to facilitate aggregate operations. If a secondary index covers a query, then the Optimizer determines that it would be less costly to accesses its rows directly rather than using it to access the base table rows it points to.

224. What is the purpose of indexes? Answer:

An index is a mechanism that can be used by the SQL query optimizer to make table access more performant.
Indexes enhance data access by providing a more-or-less direct path to stored data and avoiding the necessity to perform full table scans to locate the small number of rows you typically want to retrieve or update.

225.How does indexing improve query performance?

By creating index on single or multiple columns, query takes index access path instead of full table scan.
Full table scan is costilier operation where it has to visit all rows in the table whereas index operation just accesses the index sub table with required hash value and joins with the basetable using rowid. This improves response time and performance of the query.
Teradata automatically Creates the Primary index , if not specified by the DDl for Table .
- PI
- Secondary Index
- PPI
- Join Index
- HASH Index

Indexing is a way to physically reorganise the records to enable some frequently used queries to run faster. The index can be used as a pointer to the large table.
It helps to locate the required row quickly and then return it back to the user.
or The frequently used queries need not hit a large table for data. they can get what they want from the index itself. – cover queries.
Index comes with the overhead of maintanance.
Teradata maintains its index by itself. Each time an insert/update/delete is done on the table the indexes will also need to be updated and maintained.
Indexes cannot be accessed directly by users. Only the optimizer has access to the index.

226.What are the differences between the followings? – Vertical & Horizontal Partitioning vs – Join & Hash Indexes vs – PPI

Horizontal partitioning is used in large tables where no of rows are very high.
Rows are broken and placed in partitions.
Loading becomes faster and query performance would be good.
Vertical partitioning is used for tables with huge no of columns.
Column wise data are broken and placed in partitions with a common key.
A multi/aggregate join index basically stores the results of the join in perm space.
A hash index basically stores the results of the table based on the hash value.
PPI basically partitions the data based on the column that is specified in the partition clause.

227.What is Value-Ordered NUSIs ?

When we create a NUSI on the table a subtable is built on all AMP’s.
Each Subtable contains a row for the NUSI and corresponding row ids to the base rows located on the same AMP.
Rows in the subtable are sequenced by the row hash of the NUSI value.
This way of storing rows is convenient in storing rows with particular NUSI, but not good for range queries.
This is where Value Ordered NUSI becomes useful VALUE ordered NUSI allows rows in the subtable to be stored in order of data value , rather than hash of the value. This is useful when performing range queries.

228.Creating value ordered NUSI when creating table.

Example 1:
CREATE TABLE EMPLOYEE1
(
Employeeid integer,
DepartmentNo integer,
Salary decimal(8,2),
Hiredate date
) INDEX (Departmentno) ORDER BY VALUES;

Doing a show table will show below:

SHOW TABLE EMPLOYEE1

CREATE SET TABLE EDW_RESTORE_TABLES.EMPLOYEE1 ,NO FALLBACK ,
   NO BEFORE JOURNAL,
   NO AFTER JOURNAL,
   CHECKSUM = DEFAULT,
   DEFAULT MERGEBLOCKRATIO
   (
   Employeeid INTEGER,
   DepartmentNo INTEGER,
   Salary DECIMAL(8,2),
   Hiredate DATE FORMAT 'YYYY/MM/DD')
PRIMARY INDEX ( Employeeid )
INDEX ( DepartmentNo ) ORDER BY VALUES ( DepartmentNo ); 
--> note that DEPARTMENTNO is used to order. 
Here since we had only one column in the index, it is used for ordering.

However when we have index made of multiple columns we can choose which column to use fo ordering.

Example 2:

CREATE TABLE EMPLOYEE1
(
Employeeid integer,
DepartmentNo integer,
Salary decimal(8,2),
Hiredate date
) INDEX (Departmentno,Employeeid) ORDER BY VALUES;
/* Multiple columns as part of INDEX*/



SHOW TABLE EMPLOYEE1

CREATE SET TABLE EDW_RESTORE_TABLES.EMPLOYEE1 ,NO FALLBACK ,
   NO BEFORE JOURNAL,
   NO AFTER JOURNAL,
   CHECKSUM = DEFAULT,
   DEFAULT MERGEBLOCKRATIO
   (
   Employeeid INTEGER,
   DepartmentNo INTEGER,
   Salary DECIMAL(8,2),
   Hiredate DATE FORMAT 'YYYY/MM/DD')
PRIMARY INDEX ( Employeeid )
INDEX ( Employeeid ,DepartmentNo ) ORDER BY VALUES ( DepartmentNo ); 
--> by default system choose the 1st column DEpartmentNo for ordering.


In case of multiple columns being part of index 
we can specify the column which we would want to be used 
for ordering as shown in below example:

Example 3:

CREATE TABLE EMPLOYEE1
(
Employeeid integer,
DepartmentNo integer,
Salary decimal(8,2),
Hiredate date
) INDEX (Departmentno,Employeeid) ORDER BY VALUES(employeeid);


Creating Value ordered NUSI using CREATE INDEX syntax

CREATE INDEX(DepartmentNO) ORDER BY VALUES (DepartmentNO) on EMPLOYEE ;


Rules for using value ordered NUSI

The ordering can be done only on 1 single column.

In below we try to create a value ordered NUSI with two columns

CREATE TABLE EMPLOYEE1
(
Employeeid integer,
DepartmentNo integer,
Salary decimal(8,2),
Hiredate date
) INDEX (Departmentno,Employeeid) ORDER BY VALUES(Departmentno,Employeeid) ;

Following is the error message we get:

CREATE TABLE Failed. 5466: Error in Secondary Index DDL, Order by clause can have only one column. 

Column used for ordering must a part or all of the index definitions

In the following query we are ordering by a column that is not a part of index definition

CREATE TABLE EMPLOYEE1
(
Employeeid integer,
DepartmentNo integer,
Salary decimal(8,2),
Hiredate date
) INDEX (Departmentno,Employeeid) ORDER BY VALUES(salary) ;

Following is the error message we get:

CREATE TABLE Failed. 5466: Error in Secondary Index DDL,
 Order by field does not belong to the index. 


The column should be numeric- non numeric columns cannot be used.

Column should not be greater than 4 bytes. Thus only INT,
 SMALLINT, BYTEINT, DATE, DEC are valid. A Decimal is 
also allowed provided its storage length does not exceed 
4 bytes and it does not have any precision digits.

)

229.I have not defined PRIMARY INDEX or PRIMARY KEY on table what will happen now: In this case,

a) I have not defined PRIMARY INDEX or PRIMARY KEY on table what will happen now: In this case, Teradata will check if any column is defined as UNIQUE, then it will make that column as UNIQUE PRIMARY INDEX else first column will be created as PRIMARY INDEX.

b) I have not defined PRIMARY INDEX however a column is defined as PRIMARY KEY: In this case, Teradata will make the PRIMARY KEY column as UNIQUE PRIMARY INDEX of the table.

c) I have defined both PRIMARY KEY and PRIMARY INDEX on different column: In this case, Teradata will make PRIMARY KEY column as UNIQUE SECONDARY INDEX i.e. UNIQUE INDEX on the table. So one must understand the importance of PRIMARY INDEX in Teradata. Generally, PRIMARY KEY concept is taken care by UNIQUE PRIMARY INDEX in Teradata environment.

230.How Teradata can accesses data ?

Primary Index (fastest)
Secondary Index (second fastest way)
Full Table Scan (slowest way)

Primary Index (fastest) – When ever a Primary Index is utilized in the SQL WHERE Clause the PE will be able to use the Primary Index to get the data with a one- AMP operation.

Secondary Index (next fastest) -If the Primary Index is not utilized sometimes Teradata can utilize a secondary index.It is not as fast as the Primary Index, but it is much faster than a full table scan.

Full Table Scan (FTS) (Slowest) Teradata handles full table scans brilliantly because Teradata accesses eachdata row only once because of the parallel processing. Full Table Scans are a way to access Teradata without using an index.Each data block per table is read only once.

231.What are the reasons for product joins ?

1.Stale or no stats causing optimizer to use product join
2. Improper usage of aliases in the query.
3. missing where clause ( or Cartesian product join 1=1 )
4. non equality conditions like > ,< , between example ( date)
5. few join conditions
6. when or conditions are used. last but not the least product joins are not bad always!! sometimes PJ are better compared to other types of joins.

233.Create Table: When you are designing/creating the table, check the below steps.

Always create Multiset table.If you have a requirement to create a SET table then make sure that you have a unique primary index or unique secondary index on that table.
If there is no UPI or USI on the SET table then the new row will check every
row with same row hash value byte-by-byte to make sure that a duplicate doesnt insert into the SET table.
Choose the primary index that has even distribution and high join access. If the column has unique values and is not a primary index then create Unique Secondary Index. But Fastload and Multiload will not work if the table has Unique Secondary index.
Do not create any UNICODE columns unless absolutely required. UNICODE requires double the storage space and stats often become unusable for Teradata.
If the table has no updates then you can change default Free Space Percent Parameter of a table to 0 so that the data block can hold many rows and then all rows can be retrieved in a single I/O.
Try to create vertical partition if the table has too many columns and some of the columns are used very rare.If you reduce the table size then the data block can hold many rows so I/O can be more efficient.
If the 90% of users queries data for current year then horizontal partition gives best performance.
You can split the table by creating two tables history and current.
Another approach is by creating partition. Teradata partition reduces the overhead of scanning the complete table thus improving performance. Choose the partition table when is accessed by date ranges or character columns. Make sure that number of partitions is less.
Compression reduces the storage capacity and the number of I/Os required for an operation. Wherever possible, do the compression.
If there is a join column in a table and has more null values then replace the null values with some random number using random function and then multiply with -1 then those records can be easily distinguished from other records. Better to give random range like 1 to 10000 so it will not impact distribution if there is an upgrade in near future. Business users dont want to see negative or unknown members in the reports so replace negative values with null in an application layer.
Try to define all join columns with not null at table level even though source is providing null values. ETL logic should populate it with some random values if they are null so there will be no data skew issues or spool out error in applications.
Wherever possible try to avoid outer joins. We can avoid outer joins by inserting some unknown (dummy) records into the dimension (master) table. Same range values will be inserted into the FACT (Transaction) table wherever there is a null in corresponding column.
Note down that there is a difference between create table (copy) and create table (select). Create table (copy) retains all definitions and indexes where as create table (select) not.