**collect Statistics**

**185.What is collect State in Teradata ? What it use and how it works?**

- This statistics is used by the PE to optimize the plan which can be viewed by explain command.
- When collect stats is executed,the statistics is stored for use of PE.
- The Optimizer plans an execution strategy for every SQL query submitted to it.
- For the Optimizer to consistently choose the optimum strategy, it must be provided with reliable, complete, and current demographic information regarding all of these factors. The best way to assure that the Optimizer has all the information it needs to generate optimum execution strategies is to COLLECT STATISTICS.

- Statistics tell the Optimizer how many rows/ value there are.
- May improve performance of complex queries and joins.
- NUSI Bit Mapping requires collected statistics.
- Helpful in accessing a column or index with uneven value distribution.
- Stale statistics may mislead the Optimizer into poor decisions.
- Statistics remain valid across a reconfiguration of the system.
- COLLECT is resource intensive and should be done during off hours.

**186.When should the statistics be collected?**

Here are some excellent guidelines on when to collect statistics:

- All non unique indices
- Non index join columns
- Primary index of small tables
- Secondary indexes mentioned in the tables
- Primary Index of a Join Index
- Secondary Indices defined on any join index
- Join index columns that frequently appear on any additional join index columns that frequently appear in WHERE search conditions
- Columns that frequently appear in WHERE search conditions or in the WHERE clause of joins.

**187.How to make sure Optimizer chooses NUSI over Full Table Scan (FTS)?**

- A optimizer would prefer FTS over NUSI, when there are no Statistics defined on NUSI columns.
- It would prefer scanning for full table instead of going for Subtables in NUSI since optimizer does not have information about subtables of NUSI
- It is always suggested to collect statistics whenever NUSI columns are defined on the table. Verify whether index is being used by checking in Explain plan.

**188.what is collected in Collect Statistics**

The Teradata Parsing Engine (PE) is in charge of creating the PLAN for the AMPs to follow. The PE works best when Statistics have been collected on a table. Then it knows:

- The number of rows in the table
- The average row size Information on all Indexes in which statistics were collected
- The range of values for the column(s) in which statistics were collected
- The number of rows per value for the column(s) in which statistics were collected
- The number of NULLs for the column(s) in which statistics were collected

**189.Purpose of collect statistics**

- The optimizer uses this synopsis data to generate efficient table access and join plans
- When statistics are collected, Teradata does a full table scan, sorts the column or index
- You don’t COLLECT STATISTICS on all columns and indexes because it takes up too much space for unnecessary reasons,
- If you run an Explain on a query and the row estimate has No Confidence or Low Confidence, then that is a sign that no statistics were collected.
- The first time you collect statistics, you collect them at the index or column level. After that you just collect statistics at the table level and all previous columns collected previously are collected again. It is a mistake to collect statistics only once and then never do it again. COLLECT STATISTICS each time a table’s data changes by 10%

**190.What does collect statistics do to help the PE come up with a better plan?•**

Access Path – The PE will easily choose and use any Primary Index access (UPI or NUPI) and it will also easily choose a Unique Secondary Index (USI), but statistics really help the PE decide whether or not to do a Full Table Scan or use a Non-Unique Secondary Index (NUSI) or if it can use multiple NUSI’ s AND ed together to perform a NUSI bitmap.•

Join Method – When you collect statistics it gives Teradata a better idea whether or not to do a merge join, product join, hash join or nested join.•

Join Geography – When two rows are joined together they must physically be located on the same AMP. The only way that this happens naturally is if the join column (PK/FK) is the Primary Index of both tables. Most of the time this is not the case and Teradata must decide the Join Geography of how it will relocate the rows to co-locate them on the same AMP. Will it redistribute (rehash by the join column) one or both of the tables, or will it duplicate the smaller table across all AMPs? A redistribution or duplication are the paths to co-location.•

Join Order – All joins are performed two tables at a time. What will be the best order to join the tables together? When two or more tables are involved this becomes very important..

It is the** access path, **the **join method**, the **join geography** and the **order** that makes statistics collection so vital to all Teradata systems.

**191.How do you know if Statistics were collected on a Table?**

Syntax: HELP Statistics <Table Name>

**192.Where Does Teradata Keep the Collected Statistics?**

spread evenly across all AMPs in three tables:

- DBC.Indexes(for multi-column indexes only)
- DBC.TVFields(for all columns and single column indexes)
- DBC.StatsTbl(Teradata V14 and beyond)

**193.How to Recollect STATISTICS on a Table**

Here is the syntax for re-collecting statistics on a table:

COLLECT STATISTICS ON <Tablename> ;

The first time you collect statistics, you do it for each individual column or index that you want to collect on. When a table changes its data by 10% due to Inserts, Updates, or Deletes you merely use the command above, and it re-collects on the same columns and indexes previously collected on.

**194.what is Random AMP Sample**

Teradata Always Does a Random AMP Sample

The Parsing Engine will hash the Table ID for a table being queried and then use the Hash Map to determine which AMP will be assigned to do a Random AMP Sample for this table.

Remember that a Random AMP sample only applies to indexed columns and table row counts.

In the old days, Teradata never did a Random AMP Sample unless statistics were not collected, but these days Teradata always does a Random AMP Sample before placing the Table Header inside each AMP’s FSG Cache.

This allows Teradata to compare these statistics with collected statistics to determine if statistics are old and stale.

If the statistics are determined to be out of date then the Random AMP Sample is used.Random Sample is Kept in the Table Header in FSG Cache

Teradata compares the collected statistics to a Random AMP Sample (obtained by sampling a single AMP before placing the Table Header in FSG Cache). This compare determines if the statistics will be used or if they should be replaced by the sample.

**195.Multiple Random AMP Samplings**

The PE does Random AMP Sampling based on the Table ID.

The Table ID is hashed, and that AMP is always selected as the sampled AMP for that table. This assures that no single AMP will be tasked for too many tables, but if the table is badly skewed this can confuse the PE.

So now, more than one AMP can be sampled when generating row counts for a query plan for much better estimations on **row count**, **row size**, and r**ows per value estimates per table.**

In the DBS Control area, field 65 can now set the standard for how AMPs are sampled.

65. RandomAmpSampling – this field determines the number of AMPs to be sampled for getting the row estimates of a table.

The valid values are D, L, M, N or A.

D – The default is one AMP sampling (D is the default unless changed.)

L – Maximum of two AMPs sampling

M – Maximum of five AMPs sampling

N – Node Level sampling (all the AMPs in a node are sampled).

A – System Level sampling (all the AMPs in the system are sampled).

Multiple AMPs can now be used for the random AMP sample so a higher number of AMPs sampled will provide better estimates to counter skewed results. But, it can cause short running queries to run slower just so long running queries can run faster.

**Random AMP Estimates for NUSI Secondary Indexes**

The Random AMP reads a couple of cylinders and some data NUSI blocks and then does some simple math to estimate the Rows Per NUSI Value. The PE then knows how strong or weak the WHERE clause is using the NUSI and if it should even use the NUSI. This is the most important decision for the Parsing Engine. Should it just do a Full Table Scan or use the NUSI? That is the biggest reason the PE needs statistics. That is why you should always collect statistics on all NUSI indexes.

You really only need to collect statistics on a Unique Secondary Index column if there are a lot of SQL statements on non-equality conditions such as range queries.

There’s No Random AMP Estimate for Non-Indexed Columns

Teradata does not do a Random AMP Sample for non-indexed columns that are used in the WHERE clause of the SQL

Today’s Teradata systems always perform a random AMP sample even if tables have statistics. Then, they compare the random AMP sample with the statistics to determine if the statistics are stale.

A random AMP is selected for a random sample. Two things happen:

1) Indexes are sampled on the random AMP, and the PE estimates based on the total number of AMPs in the system.

2) If a column in the WHERE clause of the SQL is not an Index, the PE assumes that 10% of the rows will come back. If two columns are in the WHERE clause, then it assumes 7.5% of the rows will come back. If three columns are in the WHERE Clause, it assumes 5%.

**196.When to COLLECT STATISTICS Using only a SAMPLE**

You might consider Collecting Statistics with SAMPLE if:

You are collecting statistics on a very large table. | |

When collecting statistics becomes a problem with system performance or cost because the system is so busy. |

Don’t consider Collecting Statistics with SAMPLE if:

The tables are small. | |

To replace all existing full scan Collect Statistics. | |

If the column’s data is skewed badly. |

COLLECT STATISTICS can be very time consuming because it performs a full table scan and then performs a lot of statistical calculations. Because Collect Statistics runs infrequently and benefits query optimization, it is considered a necessary task. Without statistics, query performance will suffer. The bad news about sampled statistics is that they may not be as accurate, which could negatively affect the PE’s plans. In most cases, sampled statistics are better than no statistics. Don’t use Sample unless necessary!

*Examples of COLLECT STATISTICS Using only a SAMPLE*

Sampled statistics are generally more accurate for data that is not skewed. For example, columns or indexes that are unique or nearly unique are not skewed. Because the PE needs to be aware of skewed data, you should not collect with sample on skewed data. That is why sampling is generally more appropriate for indexes than non-indexed column(s). If you recollect statistics on a Sample, it recollects with the same Sample!

**197.How to Collect Statistics on a PPI Table on the Partition**

Here is the syntax for collecting statistics on a PPI table.

COLLECT STATISTICS on <Tablename> COLUMN PARTITION;

Here is an actual example COLLECT STATISTICS on Order_Table_PPI COLUMN PARTITION;

Three reasons to Collect on the Partition:

The Parsing Engine will have a better plan for PPI Tables.

This helps the most with Partition Elimination on Range Queries.

This is especially helpful when a table has a lot of empty partitions.

The Parsing Engine can use this information to better estimate the query cost when there are a significant number of empty partitions. If PARTITION statistics are not collected, empty partitions may cause the Parsing Engine to underestimate the number of rows in a partition. You shouldn’t use WITH SAMPLE to collect on Partitions.

**198.Teradata V14 Statistics Enhancements• **

There is now a SUMMARY option to collect table-level statistics.•

SYSTEM SAMPLE option allows the system to determine the sampled system percentage.• Sampling options have been enhanced (e.g., SAMPLE n PERCENT).•

Statistics are stored in DBC.StatsTbl to reduce access contention and improve performance.•

New views DBC.StatsV, DBC.ColumnStatsV, DBC.MultiColumnStatsV, and IndexStatsV.• SHOW STATISTICS statement reports detailed statistics in plain text or XML formatting.• Internal PE enhancements for histogram structure and use, including:•

Storing statistics data in their native Teradata data types without losing precision• Enhanced extrapolation methods for stale statistics•

Maintaining statistics history