HELPFUL DBA SQL
1. How do you define Teradata? Give some of the primary characteristics of the same.
Teradata is basically an RDMS which is used to drive the Datamart, Data warehouse, OLAP, OLTP, as well as DSS Appliances of the company. Some of the primary characteristics of Teradata are given below.
- Is capable of running on Single-nodes, as well as multi-nodes.
- Parallelism is built into the system.
- Very much compatible with the standards of ANSI.
- Tends to act in the same way as a server.
- It is an Open System that basically executes for UNIX MR-RAS, Suse Linux ETC, WIN2K, etc.
2. What are the newly developed features of Teradata?
Some of the newly developed features of Teradata are: –
- HASH JOIN ENHANCEMENTS
These enhancements extend the application of hash joins to include:*
- Classical and dynamic hash outer joins
- Inclusion hash semi joins and exclusion hash semi joins
- Dynamic, inclusion, and exclusion hash semi joins with dynamic partition elimination
- Hash joins with cross terms
- More efficient hash joins.
- Enhanced performance of outer joins, inclusion and exclusion semi joins, joins with partition elimination, and joins with cross terms.
Hash join enhancements with dynamic partition elimination only apply to dynamic hash
joins, inclusion hash joins, and exclusion hash joins. They do not apply to classical or
direct hash joins.
- Teradata QueryGrid:
Teradata database now able to connect Hadoop with this QueryGrid so it’s called as Teradata Database-to-Hadoop also referred as Teradata-to-Hadoop connector
- Teradata also provides a new data type which stores and process the XML data.
- DBQL Show Parameter parameterized query used to place parameters, and the parameter values are provided in a separate statement at time of execution.The main purpose is to distinguishes between code and data. Also avoids attackers from changing the query by inserting SQL commands.Two new Data Dictionary tables are
- DBC.DBQLParamTbl logs
- The Light-Weight Redistribution(LWR) also referred to as the Single Sender Redistribution (SSR). With this feature, the Teradata Database optimizer can switch from an all-AMP row redistribution to a few AMP row redistribution.
Big Data & Analytic JSON Integration Scripting and Language Support Table Operator Enhancements 3D Geospatial Data Type Geospatial Performance Enhancements SQL-H Enhancements Performance Light-Weight Redistribution Software Efficiency Improvements Ecosystem SQL Interface For ShowBlocks TASM/TIWM – Classification by Datablock Selectivity Utility Statistics Logging TASM – Global Weights Unity Callback Support TASM – Virtual Partitions Enhancement TIWM – Operating Periods For Appliance Quality Smart Retry During Parsing HSN Health New PI On Access Rights Table Unresponsive Node Isolation Onsite System and Dump Analysis DBQL – Show Parameters Industry Compatibility Sequenced Aggregate Enhancements and Aggregate JI for Temporal 1MB Phase 2 ANSI Temporal Syntax Support Teradata Directory Manager
3. Highlight a few of the important components of Teradata.
- Access Module Processor (AMP)
- Parsing Engine (PE)
- Virtual Disk (vDisk)
- Virtual Storage System (VSS)
4. What is meant by a Virtual Disk?
- Virtual Disk is basically a compilation of a whole array of cylinders which are physical disks. It is sometimes referred to as disk Array.
- A vdisk is the logical disk space that is managed by an AMP. Depending on the configuration, a vdisk may not be contained on the node; however, it is managed by an AMP, which is always a part of the node.
- The vdisk is made up of 1 to 64 pdisks (user slices in UNIX or partitions in Windows NT, whose size and configuration vary based on RAID level). The pdisks logically combine to comprise the AMP’s vdisk. Although an AMP can manage up to 64 pdisks, it controls only one vdisk. An AMP manages only its own vdisk, not the vdisk of any other AMP.
5. Why AMP & PE are called Vprocs?
AMPs and PEs are implemented as virtual processors – vprocs. They run under the control of PDE and their number is software configurable. AMPs are associated with virtual disks vdisks which are associated with logical units (LUNs) within a disk array
Vprocs:Virtual process From PE to AMP (This is the network root via MSP(message passing layer),The processing data will store in Disks(These are Physical disks),Each Amp have too many P.disks,to migrate these P.disks The Bynet Network maintains Virtual disks.These V.disks will responsible for data migration.hence they are called as Virtual Process(VPROCS).
Both AMP and PE are implemented using software and thus the word Virtual processor. There is no special hardware (like a special microprocessor) implementation required AMP and PE to function.
The versatility of the Teradata Database is based on virtual processors (vprocs)that eliminate dependency on specialized physical processors. Vprocs are a set of software processes that run on a node under the Teradata Parallel Database Extensions (PDE) within the multitasking environment of the operating system.
6. Explain the meaning of Amp?
Amp basically stands for Access Module Processor and happens to be a processor working virtually and is basically used for managing a single portion of the database. This particular portion of database cannot be shared by any other Amp. Thus, this form of architecture is commonly referred to as shared-nothing architecture.
7. What does Amp contain and what are all the operations that it performs?
Amp basically consists of a Database Manager Subsystem
- Performing DML
- Performing DDL
- Implementing Aggregations and Joins.
- Releasing and applying locks, etc.
8. What is meant by a Parsing Engine?
- PE happens to be a kind Vproc.
- Its primary function is to take SQL requests and deliver responses in SQL.
- It consists of a wide array of software components that are used to break SQL into various steps and then send those steps to AMPs.
- To put it simply, a virtual processor is a simulated processor in a processing software system, or a software version of a dedicated physical processor. Each vproc uses a portion of the physical processor resource, and runs independently of other vprocs. The vprocs co-existing within a node share a single memory pool – a collection of free memory in the node. The portion of memory allocated from the pool to one vproc will be returned to the pool after usage for use by other vprocs.
- The virtual processor (VPROC) is the basic unit of parallelism. One symmetrical multi-processing (SMP) node is consisted of several VPROCs, as many as 8 or 12.
9.What do you mean by parsing?
Parsing is a process concerned with analysis of symbols of string that are either in computer language or in natural language.
10. What are the functions of a Parser?
- Checks semantics errors
- Checks syntactical errors
- Checks object existence
11. What is meant by a dispatcher?
Dispatcher takes a whole collection of requests and then keeps them stored in a queue. The same queue is being kept throughout the process in order to deliver multiple sets of responses.
12. How many sessions of MAX is PE capable of handling at a particular time?
PE can handle a total of 120 sessions at a particular point of time.
13. Explain BYNET.
BYNET basically serves as a medium of communication between the components. It is primarily responsible for sending messages and also responsible for performing merging, as well as sorting operations.
14. Explain PDE.
PDE basically stands for Parallel Data Extension. PDE basically happens to be an interface layer of software present above the operation system and gives the database a chance to operate in a parallel milieu.
15. What is TPD?
TPD basically stands for Trusted Parallel Database, and it basically works under PDE. Teradata happens to be a database that primarily works under PDE. This is the reason why Teradata is usually referred to as Trusted Parallel or Pure Parallel database.
16. What is meant by a Channel Driver?
A channel driver is software that acts as a medium of communication between PEs and all the applications that are running on channels which are attached to the clients.
17. What is meant by Teradata Gateway?
Just like channel driver, Teradata Gateway acts as a medium of communication between the Parse Engine and applications that are attached to network clients.
Only one Gateway is assigned per node.
18. Highlight a few of the advantages that ETL tools have over TD.
- Multiple heterogeneous destinations, as well as sources can be operated.
- Debugging process is much easier with the help of ETL tools owing to full-fledged GUI support.
- Components of ETL tools can be easily reused, and as a result, if there is an update to the main server, then all the corresponding applications connected to the server are updated automatically.
- De-pivoting and pivoting can be easily done using ETL tools.
19. Mention a few of the ETL tools that come under Teradata.
Some of the ETL tools which are commonly used in Teradata are DataStage, Informatica, SSIS, etc.
20. What is the meaning of Caching in Teradata?
Caching is considered as an added advantage of using Teradata as it primarily works with the source which stays in the same order i.e. does not change on a frequent basis. At times, Cache is usually shared amongst applications.
21. What is meant by a node?
A node basically is termed as an assortment of components of hardware and software. Usually a server is referred to as a node.
22. What is meant by a Clique?
A Clique is basically known to be an assortment of nodes that is being shared amongst common disk drives. Presence of Clique is immensely important since it helps in avoiding node failures.
23. What happens when a node suffers a downfall?
Whenever there is a downfall in the performance level of a node, all the corresponding Vprocs immediately migrate to a new node from the fail node in order to get all the data back from common drives.
24. Highlight the points of differences between the database and user in Teradata.
- A database is basically passive, whereas a user is active.
- A database primarily stores all the objects of database, whereas a user can store any object whether that is a macro, table, view, etc.
- Database does not has password while the user has to enter password.
25. Explain how spool space is used.
Spool space in Teradata is basically used for running queries. Out of the total space that is available in Teradata, 20% of the space is basically allocated to spool space.
26.How the data is distributed among AMPs based on PI in Teradata?
- Assume a row is to be inserted into a Teradata table
- The Primary Index Value for the Row is put into the Hash Algorithm
- The output is a 32-bit Row Hash
- The Row Hash points to a bucket in the Hash Map.The first 16 bits of the Row Hash of is used to locate a bucket in the Hash Map
- The bucket points to a specific AMP
- The row along with the Row Hash are delivered to that AMP
When the AMP receives a row it will place the row into the proper table, and the AMP checks if it has any other rows in the table with the same row hash. If this is the first row with this particular row hash the AMP will assign a 32-bit uniqueness value of 1. If this is the second row hash with that particular row hash, the AMP will assign a uniqueness value of 2. The 32-bit row hash and the 32-bit uniqueness value make up the 64-bit Row ID. The Row ID is how tables are sorted on an AMP. This uniqueness value is useful in case of NUPI’s to distinguish each NUPI value. Both UPI and NUPI is always a One AMP operation as the same values will be stores in same AMP.
27. How Teradata retrieves a row?
For example, a user runs a query looking for information on Employee ID 100. The PE sees that the Primary Index Value EMP is used in the SQL WHERE clause. Because this is a Primary Index access operation, the PE knows this is a one AMP operation. The PE hashes 100 and the Row Hash points to a bucket in the Hash Map that represents AMP X. AMP X is sent a message to get the Row Hash and make sure it’s EMP 100.
28.NO PI in Teradata?
CREATE TABLE <TABLE> (
PK INTEGER NOT NULL
) NO PRIMARY INDEX ;
NOPI tables come with some further restrictions, find below the most important ones:
– Only MULTISET tables can be created
– No identity columns can be used
– NoPI tables cannot be partitioned with a PPI
– No statements with an update character allowed (UPDATE,MERGE INTO,UPSERT), still you can use INSERT,DELETE and SELECT
– No Permanent Journal possible
– Cannot be defined as Queue Tables
– No Queue Tables allowed
– Update Triggers cannot update a NOPI table (probably introduces with a later release)
– No hash indexes are allowed (use join indexes instead)
The following features can be used as usual together with NOPI tables:
– Fallback protection of the table
– Secondary Indexes (USI, NUSI)
– Join Indexes
– CHECK and UNIQUE constraints
– Collection of statistics
- The information about NOPI tables, as in the case of any table type, is stored in DBC.Tables. The tablekind is ‘O’:
- (Fastloads, TPUMP array insert loads) is faster than loading tables with a Primary Index.
- After distributing the rows randomly across the AMPs we are already finished. No hashing and redistribution is needed. No sorting is needed. Furthermore, as the rows are assigned randomly to the AMPs, our data will always be distributed evenly across all AMPs and no skewing will occur.only the acquisition phase of the loading utilities is executed
- rows are always appended to the end of the table’s data blocks, thus any overhead usually caused by sorting the rows by rowhash into the data blocks is avoided.
- Bulk loaded small tables will always be skewed. Round robin is done on block level and not the row level. Some AMPs will receive data blocks and the rest will receive no data.
- If you execute an INSERT…SELECT statement, from a primary index table into a NOPI table, AMP-local copying of the rows will be applied. This means, all rows stay on their current AMP, and if the primary index table is skewed you, will end up with a skewed NOPI table.
- Skewing can happen, by executing an archive job on a source system, and a restore of the backup on a different target system.
- NOPI tables are useful in certain situations, but without a Primary Index, row access is limited to All-AMP full tables scans
- NOPI tables are useful in certain situations, but without a Primary Index, row access is limited to All-AMP full tables scans. In order to offset this impact, you could add an unique secondary index (USI) to your NOPI table.mainly consider using them during the ETL-Process, in situations when anyway a full table scan is required.
29. SubQuery and Correlated Subquery in teradata?
subquery is that it retrieves a list of values that are used for comparison against one or more columns in the main query. Here the subquery is executed first and based on the result set, the main query will be executed.
Select empname,deptname from employee where empid IN ( select empid from salarytable where salary>10000).
In the above query, empid will be chosen first based on the salary in the subquery and main query will be executed based on the result subset.
Correlated Subquery is an excellent technique to use when there is a need to determine which rows to SELECT based on one or more values from another table.It combines subquery processing and Join processing into a single request.
It first reads a row from the main query and then goes into the subquery to find the rows that match the specified column value.Then it goes for the next row from the main query. This process continues until all the qualifying rows from MAIN query.
select empname,deptno, salary from employeetable as emp
where salary=(select max(salary) from employeetable as emt where emt.deptno=emp.deptno)
Above query returns the highest paid employee from each department. This is also one of the scenario based questions in teradata.
30.Which is more efficient GROUP BY or DISTINCT to find duplicates in Teradata?
- With more duplicates GROUP BY is more efficient
- while if we have fewer duplicates the DISTINCT is efficient.
- It depends on the tablesize. If the records are more than half a million then GROUP BY is much better than DISTINCT.
- However if the records are very less than DISTINCT performance is better than GROUP BY.
31. What is spool space and when running a job if it reached the maximum spool space how you solve the problem in Teradata?
- WHERE clause is missing, it will cause a special product join, called Cartesian Join or Cross Join, Spool space is the space which is required by the query for processing or to hold the rows in the answer set.
- Spool space reaches maximum when the query is not properly optimized.
- We must use appropriate condition in WHERE clause and JOIN on correct columns to optimize the query.
- Also make sure unnecessary volatile tables are dropped as it occupies spool space.
32.What is Vdisk and how it will communicate with physical data storage at the time of data retrieval through AMP ?
- The total disk space associated with an AMP is called a vdisk. A vdisk may have up to three ranks. Hence Vdisk will communicate with physical storage through array controllers.
- Each AMP vproc must have access to an array controller, which in turn accesses the physical disks.
- AMP vprocs are associated with one or more ranks (or mirrored pairs) of data
33: After creating tables dynamically in the Teardata, where is the GRANT table option usually done ? When tables are newly created, what is the default role and what the default privileges which get assigned ?
- The GRANT option for any particular table depends on the privileges of the user. If it is an admin user you can grant privileges at any point of time.
- The default roles associated with the newly created tables depend on he schema in which they are created.
34: what is basic teradata query language?
- BTEQ(Basic teradata query) It allows us to write SQL statements along with BTEQ commands.
- We can use BTEQ for importing,exporting and reporting purposes.
- The commands start with a (.) dot and can be terminated by using (;), it is not mandatory to use (;).
- SQL statements doesn’t start with a dot , but (;) is compulsory to terminate the SQL statement.
- BTEQ will assume any thing written with out a dot as a SQL statement and requires a (;) to terminate it..
35. Difference between BTEQ and Sql assistant (query man)?
BTEQ : Basic Teradata Query utility
- SQL front-end : Report writing and formatting features
- Interactive and batch queries
- Import/Export across all platforms
- The default number of sessions, upon login, is 1.
Teradata Query Manager / Queryman / Teradata SQL Assistant
- SQL front-end for ODBC compliant databases
- Historical record of queries including:
- Row counts
- Random sampling feature
- Limit amount of data returned
- Import/Export between database and PC
- Export to EXCEL or ACCESS
36.What is explain in teradata?
- The EXPLAIN modifier preceding an SQL statement generates an English translation of the optimizer’s plan. It is fully parsed and optimized, but not executed.
- EXPLAIN returns: Text showing how a statement will be processed (a plan).
- An estimate of how many rows will be involved.
- A relative cost of the request (in units of time).
- This information is useful for:
- Predicting row counts.
- Predicting performance.
- Testing queries before production.
- Analyzing various approaches to a problem
- EXPLAIN may also be accessed using Teradata Visual Explain.
- The execution strategy provides what an optimizer does but not why it chose them.
- The EXPLAIN facility is used to analyze all joins and complex queries.
37.What is log table? What is the use of log table?
- Data conversion errors,
- Constraint violations
- and other error conditions:
- Contains rows which failed to be manipulated due to constraint violations or Translation error
- It logs errors & exceptions that occurs during the apply phase.
- It logs errors that are occurs during the acquisition phase.
- Log table in MLOAD is used to store the results of each phase of the operation. This log table is used by MLOAD for restarting the mload process from the appropriate checkpoint i.e. after the successful phase.
- Captures rows that contains duplicate Values for UPIs.
39.How Terdata Row Distribution happen
- Each AMP holds a portion of every table.
- Each AMP keeps their tables in separate drawers.
- Each table is sorted by Row ID.
40.Why should we go Teradata over Oracle? What is the main differences?
- Teradata Architecture is shared nothing.It supports unconditional parallelism.
- Data : supports terabytes of data.
- Load : we can load bulk data into tables with less time.
- Scale : Teradata is highly scalable
41.Explain parallel distribution and subtable concept in teradata?Parallel distribution:
Teradata stores the data uniformly. Ideally, every amp should have equal no of rows from each table in vdisks. Based on the primary index, the distribution of data would be uniform on amps.
Sub table: This table is created when a secondary Index is created on the table, it will have rowid,data and hash value for the data. It is used to fetch the data using secondary index.
42.What is activity count? What is skew factor?
- Activity count indicates the no of rows affected/processed by the last request(no of rows returned to bteq from teradata)
- skewness refers to the distribution of rows on the amps. If some AMPs are having more rows and some very less, then skewness would be high. This would affect the parallelism.
- Major are 3 type of skews (CPU skew,AMP/Data skew, IO Skew). –
- Data skew? When data is not distributed equally on all AMPs. –
- CPU skew? Who is taking/consuming more CPU called CPU skew. –
- IO skew? Who perform more IO Operation. called IO Skew.
44.Increasing number of amps will increase performance. Is this true?
Yes,increase the no of amps with same no of users–performance increases proportionately.
Increase the no of PEs to increase the no of sessions.
Increase the no of disks to increase the data storage.
45.How to copy 1 million records from one DB to another, Which is effective in terms of performance?
1) In case of data migration from one db to another db in two different environments, nparc utility can be used
2) TPT utilities can be be used transfer data from one environment other environment
46.The following table types exists in Teradata:
- Permanent Table – Once created, remains in the database until it is dropped. It can be shared with different sessions and users.
- 1. SET table – Strictly NO duplicate values [By-Default]
- 2. MULTISET table – Allows Duplicate Values.
- Temporary Tables:
- Volatile Table – Only the session in which it is created can access the table structure and data. Session ends, table gets dropped.
- Global Temporary Table – Once created, the structure can be shared with different sessions and users but the data remains private to the session who loaded it. Session ends, its private data gets deleted.
- Derived Table – A type of temporary table derived from one or more other tables as the result of a sub-query.
- Queue Table – Permanent tables with timestamp column which indicates when each row was inserted.
- Error Logging Table – Stores errors of an associated permanent table.
- NoPI Table – Permanent tables that do not have any primary index defined on them
47.What is difference between Error code and Error Level in Bteq ?
Errorcode is used in bteq conditional logic. It holds the return code of every sql in bteq. 0 indicates perfect execution of the sql. The code returned by the previous sql statement which resembles the error message information(we can see in DBC.ErrorMsgs table). Return Code: The code return by the script after execution. These return codes can be 0,2,4,8,12,16.
Errorlevel assigns severity to error numbers. To know the error codes type select * from dbc.errormsgs in the Queryman Error level is the severity set for the corresponding error code if error level is 0- Success 4- Warning 8- User Error 12- Fatal error 16- Typical Error
48. What is the difference between Access Logging and Query Logging in Teradata?
- Access Logging is concerned with security (i.e. who’s is doing what). In access logging you ask the database to log who’s doing what on a given object. The information stored is based on the object not the SQL fired or the user who fired it.
- Query Logging (DBQL) is used for debugging (i.e. what’s happening around ?). Incase of DBQL database keep on tracking various parameters i.e. the SQLs, Resource, Spool Usage, Steps and other things which help you understand
49.Commands used within Ferret Utility?
- SHOWSPACE –
- amount of Disk Cylinder Space is in use
- amount of Disk Cylinder Space is available in the system.
- This will give you an information about
- Permanent Space cylinders,
- Spool Space cylinders,
- Temporary Space cylinders,
- Journaling cylinders,
- Bad cylinders and
- Free cylinders.
- For each of these 5 things it will present you 3 parameters i.e.
- Average Utilization per cylinder,
- % of total available cylinders and
- number of cylinders.
50.Explain Ferret Utility in Teradata?
- Ferret (File Reconfiguration tool) is an utility which is used to display and set Disk Space Utilization parameters within Teradata RDBMS. When you select the Ferret Utility parameters, it dynamically re configures the data on disks.
- We can run this utility through Teradata Manager; to start the Ferret Utility type (START FERRET) in the database window.
51. What does SLEEP function does in Fast load?
- The SLEEP command specifies the amount minutes to wait before retrying to logon and establish all sessions. Sleep command can be used with all load utilities not only fastload. This situation can occur if all of the loader slots are used or if the number of requested sessions is not available.
- The default value is 6 minutes. If tenacity was set to 2 hours and sleep 10 minutes, Sleep command will try to logon for every 10 minutes up to 2 hours duration.
52. What is multi Insert?
- Inserting data records into the table using multiple insert statements. Putting a Semi colon in front of the key word INSERT in the next statement rather than Terminating the first statement with a semi colon achieves it.
- Insert into Sales “select * from customer” ; Insert into Loan “select * from customer”;
53. Which one is better IN or BETWEEN?
If we have to search for range of values, it is always advisable to use BETWEEN rather than list of values in IN clause. BETWEEN tells the optimizer that it is range of values we are going to handle rather than individual values. Hence, BETWEEN is much faster than IN.
54.What is default session transaction mode in Teradata?
- Teradata has two session transaction mode : BTET(Default) and ANSI.
- To change session mode we use following command before establishing session: .SET SESSION TRANSACTION (ANSI/BTET)
55.What are the options that are not available for global temporary tables ?
GTT definitions are created in Data dictionary. These table provide separate instance to each user who refers to them . The following options are not available for global temporary tables:
- Any kind of constraints like check/referential cannot be applied on table –
- Identity columns since data in GTT are materialized only for session –
- Permanent Journaling cannot be done as data in tables are instances only to user in that session –
- PPI cannot be applied as data does not get stored in PERM , only TEMP space is utilized here.
56.What are the options not available for volatile tables in teradata ?
The following options are not available for volatile tables because table definition is not stored in data dictionary
- Default values for columns –
- Title clause for columns –
- Named Indexes for table –
- Compression on columns/table level since table data are spooled –
- Stats cannot be collected since data is materialized only for session –
- Identity columns as these again would need entry in IDcol tables –
- PPI cannot be done on tables – Any kind of constraints like check/referential cannot be applied on table
57.Explain about Skew Factor?
The data distribution of table among AMPs is called Skew Factor Generally For Non-Unique PI we get duplicate values so the more duplicate vales we get more the data have same rowhash so all the same data will come to same amp, it makes data distribution inequality, One amp will store more data and other amp stores less amount of data, when we are accessing full table, The amp which is having more data will take longer time and makes other amps waiting which leads processing wastage In this situation (unequal distribution of data)we get Skew Factor High For this type of tables we should avoid full table scans
58.What is the acceptable range for skew factor in a table?
There is no particular range for skew factor. In case of production systems, it is suggested to keep skew factor between 5-10.
There are various considerations for skew factor –
- Number of AMPS
- Size of row in a table
- number of records in a table
- PI of a table
- Frequent access of table (Performance consideration)
- whether table getting loaded daily /monthly or how frequently data is being refreshed
59. How do you determine the number of sessions?
- Teradata performance and workload
- Client platform type
- Channel performance for channel attached systems
- Network topology and performance for network attached systems
- Volume of data to be processed by the application
60.What is multivalued compression in Teradata?
Multi valued compression or just MVC is a compression technique applied on columns in Teradata . MVC has a unique feature of compressing up-to 255 distinct values per column in a given table. The advantage of compression are
- Reduced Storage cost by storing more of a logical data than physical data.
- Performance is greatly improves due to reduced retrieval of physical data for that column.
- Tables having compression always have an advantage since optimizer considers reduced I/O as one of the factors for generating EXPLAIN plan.
61. What is use of compress in terdata?Explain?
- Compression is used to Minimize the table size, for example when the table size is increasing anonymously We can use Compression to reduce the size of the table Conditions:
- Compression can be declared at the time of table creation
- We can compress up to 256 column values(not columns)
- We can’t compress variable length fields (vartext,varchar.
- The compressed value is stored in column header and is to be used as default of that column unless a value is present. e.g. Dept of Taxes in Washington has a database that tracks all people working in Washington. Around 99.9% of the tax payers would have Washington as their state code on their address. Instead of storing “Washington” in millions of records the compress will store the value “Washington” in the table header. Now, a value can be assumed in a row as a default unless another value exists inside the column.
62. Can you compress a column which is already present in table using ALTER in Teradata?
- No, We cannot use ALTER command to compress the existing columns in the table.
- A new table structure has to be created which includes the Compression values and data should be inserted into Compress column table.
- Please note – ALTER can be used only to add new columns with compression values to table.
63.how do you you implement Multi valued compression in an existing table?
- A new column with multi valued compression can be added to an existing table, but cannot modify existing compressed column.
- Create a new table with column which has MVC and do insert .. select from original table Or
- CREATE TABLE… as with column designated MVC.
64.What are advantages of compression on tables?
- They take less physical space then uncompressed columns hence reducing space cost
- They improve system performance as less data will be retrieved per row fetched ,
- more data is fetched per data block thus increasing data loading speed – They reduce overall I/O
65.What are the ways by which we can use zero to replace a null value for a given column ?
- Select Col1, ZEROIFNULL(Col2) from Table_name;
- Coalesce Select Col1,COALESCE(Col2,0) from Table_name;
- .Case operator select Case When Col2 IS NOT NULL Then Col2 Else 0 End from Table_name;
66.Can a macro be called inside a macro?
The main purpose of run a set of repeated sql queries. Macro supports only DML queries . Hence We cant call any-other macro or not even a procedure in a macro. One trick to have closest functionality of this is to copy all the sql queries from macro2 to macro1 and add parameters if it is necessary
67.How do you find the list of employees named “john” in an employee table without using Like operator??
This question seems tricky.. but yes there is another way by which we can find names/patters without using like operator. By using “BETWEEN” , we can find the list of employees named john… sel * from employee where name between ‘J’ and ‘K’; But at times usage of between is tricky, if there are other employees starting with J, those employees will also be listed by this query.
68.What are different types of Spaces available in Teradata ?
There are 3 types of Spaces available in teradata ,they are
- This is disk space used for storing user data rows in any tables located on the database.
- Both Users & databases can be given perm space.
- This Space is not pre-allocated , it is used up when the data rows are stored on disk.
- It is a temporary workspace which is used for processing Rows for given SQL statements.
- Spool space is assigned only to users . –
- Once the SQL processing is complete the spool is freed and given to some other query.
- Unused Perm space is automatically available for Spool .
. TEMP space
- It is allocated to any databases/users where Global temporary tables are created and data is stored in them.
- Unused perm space is available for TEMP space
69.What is hash collision ?
This occurs when there is same hash value generated for two different Primary Index Values. It is a rare occurance and Has been taken care in future versions of TD
70.What is RAID, What are the types of RAID?
Redundant Array of Inexpensive Disks (RAID) is a type of protection available in Teradata. RAID provides Data protection at the disk Drive level. It ensures data is available even when the disk drive had failed. Th\ere are around 6 levels of RAID ( RAID0 to RAID5) . Teradata supports Two levels of RAID protection RAID 1 – Mirrored copy of data in other disk RAID 5 – Parity bit (XOR) based Data protection on each disk array. One of the major overhead’s of RAID is Space consumption
71.What are Restrictions on Views in Teradata?
- An index cannot be Created on a view.
- It cannot contain an ORDER BY clause.
- All the derived columns and aggregate columns used in the view must have an AS clause (alias defined).
- A view cannot be used to UPDATE if it contains the following :
- Data from more than one table (JOIN VIEW)
- The same column twice
- Derived columns
- A DISTINCT clause
- A GROUP BY clause
72.list Built-in functions used in teradata ?
- SESSION: – Returns a number for the session for current user . •
- TIME: – this function provides the current time based on a 24-hour day • USER: – This one gives the user name of the current user.
- • ACCOUNT: – display’s your Teradata Account information •
- CURRENT_DATE: – Returns the current system date •
- CURRENT_TIME: – This function returns the current system time and current session ‘Time Zone’ displacement. •
- CURRENT_TIMESTAMP: – Returns the current system timestamp with TimeZone DATABASE: – It returns the name of the default database for the current user.
- DATE: – same as Current_DATE and is teradata built in .
73.What are the difference types of temporary tables in Teradata?
- Global Temporary tables (GTT) –
- When they are created, its definition goes into Data Dictionary.
- When materialized data goes in temp space.
- That’s why, data is active up to the session ends, and definition will remain there up-to its not dropped using Drop table statement. If dropped from some other session then its should be Drop table all;
- You can collect stats on GTT.
- Defined with the CREATE GLOBAL TEMPORARY TABLE sql
- Volatile Temporary tables (VTT) –
- Local to a session (deleted automatically when the session terminates)
- Table Definition is stored in System cache .
- Data is stored in spool space.
- That’s why; data and table definition both are active only up to session ends.
- No collect stats for VTT.If you are using volatile table,
- you can not put the default values on column level (while creating table)
- Created by the CREATE VOLATILE TABLE sql statement
- Derived tables
- Derived tables are local to an SQL query.
- Not included in the DBC data dictionary database, the definition is kept in cache.
- They are specified on a query level with an AS keyword in an sql statement
74. Why is BTET transaction processing overhead in Teradata.
- BTET makes all the queries running between BT and ET as single transaction . If any of query fails then the entire set of queries will not be committed
- BTET also has an overhead with locking , since it holds locks on tables till the ET is occured or all the queries have executed successfully
- DDL statements cannot be used everywhere in BTET processing , but these statements can be given towards the end of BTET transaction.
- Using large number of BTET caused transient Journal to grow and has tendancy for System Restart.
75.what are the different date formats available in Teradata system?
- The Teradata default format is: YY/MM/DD
- The ANSI display format is: YYYY-MM-DD
76.How do you set default date setting in BTEQ?
They have to be set after logging on to the session
set session dateform = ANSIDATE ; /*format is yyyy-mm-dd */
set session dateform = integerdate ; /* format is yy/mm/dd -teradata date format */
77.What does DROP table table_name command do?
- It deletes all data in table_name
- Removes the definition from the data dictionary
- Removes all explicit access rights on the table
78.Is Like comparison case-sensitive in Teradata?
LIKE operator is not case sensitive in Teradata session mode. Consider the following example Select F_name from employee where F_name like ‘%JO%’; The following query will pick values matching with ‘JO’ and ‘jo’ as well, since Teradata is not case-sensitive To overcome this problem, a new function called “CASESPECIFIC” is used in TERADATA as follows Select F_name from employee where F_name (CASESPECIFIC) like ‘%JO%’;
79.How do you submit bteq script (batch mode)?
Start the BTEQ , by typing BTEQ
Enter the following command
.run file = BTEQScript.btq OR
Bteq < BTEQScript.btq
BTEQ Script.btq contains following .logon 127.0.0.1/dbc, dbc; sel top 10 * from dbc.tables; .quit These are some tricky questions asked about explain to see if an candidate has really worked on query tuning and how much he would know about optimizer & explain .
80.What is residual condition?
Residual condition means a filter is applied on particular table to limit the number of rows fetched into Spool. Say for example Sel * from EMP.Employee where emp_sal > 10000; Here , residual condition will act upon to fetch only employees with salary greater than 10000
81.What is Spool “ Last use”
The particular spool file used in the step will be used for last time and the spool space will be released since it is no longer used in further steps of the query execution.
82.How do you create materialized view in Teradata?
84.Which two statements are true about a foreign key?
Each Foreign Key must exist as a Primary Key.
Foreign Keys can change values over time.
First: True Second: False
85.What are the benefits of fallback?
- Protects your data from hardware (disk) failure.
- Protects your data from software (node) failure.
- Automatically recovers with minimum recovery time, after repairs or fixes are complete
86.What is a Dirty-Read or Stale-Read Lock?
This occurs when a access lock is applied on the table which is doing a update. May produce erroneous results if performed during table maintenance resulting in Dirty Read or stale read , which might result in inconsistent result set.
It is a Teradata specified Join, which is used as equivalent to product join. There is no “On” clause in case of CROSS join SELECT EMP.ename , DPT.Dname FROM employee EMP CROSS JOIN Department DPT WHERE EMp.deptno = DPT.depto
88. Difference between Stored Procedure and Macro?
- It does not return rows to the user.
- It has to use cursors to fetch multiple rows
- It used Inout/Out to send values to user
- It Contains comprehensive SPL
- It is stored in DATABASE or USER PERM
- A stored procedure also provides output/Input capabilities
- It returns set of rows to the user.
- It is stored in DBC SPOOL space
- A macro that allows only input values
89. What are the scenarios in which Full Table Scans occurs?
- The where clause in SELECT statement does not use either primary index or secondary index
- SQL Statement which uses a partial value (like or not like), in the WHERE statement.
- SQL Statement which does not contain where clause.
- SQL statement using range in where clause. Ex. (col1 > 40 or col1 < =10000)
90.How to eliminate Product Joins in a Teradata SQL query?
- Ensure statistics are collected on join columns and this is especially important if the columns you are joining on are not unique.
- Make sure you are referencing the correct alias.
- Also, if you have an alias, you must always reference it instead of a fully qualified tablename.
- Sometimes product joins happen for a good reason. Joining a small table (100 rows) to a large table (1 million rows) a product join does make sense.
HIGH CONFIDENCE:Statistics are collected.
LOW CONFIDENCE: Statistics are not collected. But the where condition is having the condition on indexed column. Then estimations can be based on sampling.
NO CONFIDENCE: Statistics are not collected and the condition is on non indexed column.
92. What is Teradata Virtual storage?
- This concept is introduced in TD12. It does the following tasks
- Maintains Information On Frequency Of Data Access
- Tracks Data Storage Task On Physical Media
- Migrating Frequently Used Data To Fast Disks And Less Frequently Used Data To Slower Disks.
- Allocating Cyclinders From Storage To Individual Amps
93.How to start / stop a database in windows?
- logon to CMD
- check for state pdestate -d
- run the following command to start “net start recond”
- check for status pdestate -d
- to STOP the database Trusted Parallel Application or TPA
- tpareset -x comment
- The -x option stops Teradata without stopping the OS.
94 What is a role?
A role is a set of access rights which can be assigned to the users. They indirectly help in performance by reducing the number of rows entered in DBC.accessrights
95.What is a profile?
A profile contains set of user parameters like accounts, default database, spool space, and temporary space for a group of users To assign the profile to a user, use the AS PROFILE modified in a CREATE USER or MODIFY USER statement: MODIFY USER username AS PROFILE=profilename ; To remove a profile from a member but retain the profile itself: MODIFY USER username AS PROFILE=NULL ;
96. How to check if given object is a database or user ?
To check whether the given object is user or database , we can use following query sel * from dbc.databases where dbkind =’U’ or dbkind=’D’;
97. In a table can we use primary key in one column and in another column both unique and not null constrains.if yes how?
Yes, you can have a column for Primary key and have another column which will have no duplicates or null.e.g.A Salary Table will have employee ID as primary key. The table also contains TAX-ID which can not be null or duplicate
98. Is it possible that there are two primary key will be in a same table?
Primary key ==========
- 1. A table should have only one primary key
- 2. More than one column can consist of a primary key – upto 64 columns
- 3. Can not be NULL values (missing values)
- 4. Should be unique values (no duplicate value)
Foreign key ==========
- 1. A table may have zero or more than that up-to 32 keys
- 2. More than one column can consist a primary key – up to 64 columns
- 3. Can have NULL values
- 4. Can have duplicate values
- 5. There should be a corresponding primary key in the parent table to enforce referential integrity for a foreign key
99.Can you recover the password of a user in Teradata?
No, you can’t recover the password of a user in Teradata. Passwords are stored in this data dictionary table (DBC.DBASE) using a one-way encryption method. You can view the encrypted passwords using the following query SELECT * FROM DBC.DBASE;
100.What are the functions of a Teradata DBA?
- User Management – Creation and managing Users, Databases, Roles, Profiles and Accounts.
- Space Allocation – Assigning Permanent Space, Spool Space and Temporary Space.
- Access of Database Objects – Granting and Revoking Access Rights on different database objects.
- Security Control – Handling logon and logoff rules for Users.
- System Maintenance – Specification of system defaults, restart etc.
- System Performance – Use of Performance Monitor(PMON), Priority Scheduler and Job Scheduling.
- Resource Monitoring – Database Query Log(DBQL) and Access Logging.
- Data Archives, Restores and Recovery – ARC Utility and Permanent Journals.
101. How does Hashing happens in Teradata?
- Hashing is the mechanism through which data is distributed and retrieved to/from AMPs.
- Primary Index (PI) value of a row is the input to the Hashing Algorithm.
- Row Hash (32-bit number) value is the output from this Algorithm.
- Table Id + Row Hash is used to locate Cylinder and Data block.
- Same Primary Index value and data type will always produce same hash value.
- Rows with the same hash value will go to the same AMP.
- So data distribution depends directly on the Row Hash uniqueness; be careful while Choosing Indexes in Teradata.
102. How to insert a new line between two strings concatenated in answer sets of teradata.
It can be achieved by using the hexadecimal string ’0D’XC for this. eg: SELECT ‘ABC’ || ’0D’XC || ‘DEF’ ; When you execute this in queryman you will see the title in two lines. But the result will be in a line, just copy/paste the result in Dos or windows or notepad or textpad it will appear in two lines
103..Verify if two database are in sync and having same objects ?
104.How to recover left over spool space in Teradata ?
To Identify the left over spool, use the following SQL.
105. Explain LOB?
It is possible to create functions which operate on large object data types. The following library functions exists to provide the ability use LOB’s as function input or output:
106.What is the difference between EXCEPT and MINUS operator in Teradata SQL?
EXCEPT is ANSI Compliant and MINUS is Teradata specific. Function wise both are same.
107: How many codd’s rules are satisfied by teradata database?
There are 12 codd’s rules applied to the teradata database
108. What is meant by a Least Cost Plan?
A Least Cost Plan basically executes in less time across the shortest path.
109. What is meant by a Highest Cost Plan?
As per Highest Cost Plan, the time taken to execute the process is more, and it takes the longest path available.
110.Why Fload doesn’t support multiset table?
For Fast Performance it will not support the duplicate values,Refential Intigrity
111. What is FILLER command in Teradata?
while using the mload of fastload if you don;t want to load a particular filed in the datafile to the target then use this filler command to achieve this :
112.Why does varchar occupy 2 extra bytes?
The two bytes are for the number of bytes for the binary length of the field. It stores the exact no of characters stored in varchar
113. How to you verify a complicated SQL?
114. In Teradata, how do we Generate Sequence?
In Teradata, we Generate Sequence by making use of Identity Column
115. How do you see a DDL for an existing table?
By using show table command. .
116.What is the Maximum number of volatile tables that can be created?
We can override default read lock on any table by explicitly mentioning the ACCESS LOCK before using the table. Syntax is: LOCK TABLE EMPLOYEES FOR ACCESS SELECT * FROM EMPLOYEES;
118 How to check all the volatile tables created for the current session ?
HELP VOLATILE TABLE;
119. Where is password stored in Teradata?
User can view the password in DBC.DBASE view however it will be in encrypted format.
120. How to specify security mechanism option in BTEQ?
- We can use logmech command to specify security mechanism in BTEQ in Teradata
- eg: .LOGMECH LDAP;
121. What is datatype of keyword NULL ?
- NULL keyword is of INTEGER datatype.
122. ORDER BY clause is not allowed along with SELECT clause in which situation ?
123. Can you specify ORDER BY COLUMN_NAME in SET OPERATORS like UNION/MINUS etc ?
No. If you need to use ORDER BY , then specify COLUMN ORDER rather than COLUMN NAME.
124. While exporting data using BTEQ Export, you are getting junk characters in the exported file. How will you remove these junk characters ?
Add a BTEQ setting before your SELECT clause in BTEQ Export: .SET RECORDMODE OFF;
125. How can I change NULL and DATE values representation in TSA?
Go to TOOLS –> OPTIONS –> DATA FORMAT
126. When we get the error in TSA, we can see it in bottom status bar. However how to see last encountered error when we have executed some other successful query?
You can press F11 and it will display last error encountered.
127. When I am fetching column from table of BIGINT datatype I am seeing junk characters. How to see proper values? I am using old version of SAT.
Try using CAST to change the value to VARCHAR and you can see the value now. Alternatively, you can use BTEQ utility to preview correct format values.
128. What are the connectivity options available for TSA?
ODBC and Teradata.net
129. Name three result code variables in Stored Procedures?
130. What is the default value for any three result code variables?
SQLSTATE : ‘00000’,
131. What are the SQL SECURITY Options available in Stored Procedures in Teradata?
SQL SECURITY OWNER/CREATOR/DEFINER/INVOKER
131. What is the default SQL SECURITY OPTION in Stored Procedures in Teradata?
SQL SECURITY DEFINER
132. I want to process all the column values in some table using Stored Procedure. How to implement it in Teradata ?
We can use REF CURSOR to implement this in Stored Procedure in Teradata.
133.Creating Duplicate users for Teradata ?
Teradata Administrator (WinDDI) gives us an option to clone the users.
134. While retrieving COUNT(*) in your SQL Query, it is giving numeric overflow error. How will you get the count now ?
- Try to cast count(*) to some other bigger datatype in order to avoid numeric overflow error:
- SEL CAST(COUNT(*) AS DECIMAL(18,0)) FROM TABLE_NAME;
135. Will the count be same if we use COUNT(*), COUNT(20), COUNT(COLUMN_NAME) from some table ?
COUNT(*) and COUNT(20) will give same number of records however COUNT(COLUMN_NAME) will return number of records excluding NULL values. So if all the records are NOT NULL , then all three will result in same number of records.
136. While exporting data using BTEQ Export, you are getting junk characters in the exported file. How will you remove these junk characters ?
Add a BTEQ setting before your SELECT clause in BTEQ Export: .SET RECORDMODE OFF;
137.How to find the number of AMPs in any system?
SELECT HASHAMP()+1 ; [ We have to do +1 because the HASHAMP number starts with 0]
138 .How will you connect a database server to other server?
139. Syntax for CASE WHEN statement?
140.What are the types of HASH functions used in teradata?
- These are the types of HASH, HASHROW, HASHAMP and HASHBAKAMP.
- Their SQL functions are- HASHROW (column(s)) HASHBUCKET (hashrow) HASHAMP (hashbucket) HASHBAKAMP (hashbucket)
- To find the data distribution of a table based on PI, below query will be helpful. This query will give the number of records in each AMP for that particular table.
SELECT HASHAMP(HASHBUCKET(HASHROW(PI_COLUMN))),COUNT(*) FROM TABLENBAME GROUP BY 1.
141. What’s the difference between TIMESTAMP (0) and TIMESTAMP (6)?
TIMESTAMP (0) is CHAR (19) and TIMESTAMP (6) is CHAR (26) Everything is same except that TIMESTAMP (6) has microseconds too.
142. How can we check the version of Teradata that we are using currently?
Just give the command .SHOW VERSION.
143. Mention the procedure via which, we can run Teradata jobs in a UNIX environment.
$Sh > BTEQ < [Script Path] > [Logfile Path] or
$Sh > BTEQ < [Script Path] TEE [Logfile Path]
144. In BTEQ, how are the session-mode parameters being set?
.set session transaction BTET -> Teradata transaction mode
.set session transaction ANSI -> ANSI mode
These commands will work only when they are entered before logging into the session.
145. While retrieving COUNT(*) in your SQL Query, it is giving numeric overflow error. How will you get the count now ?
- Try to cast count(*) to some other bigger datatype in order to avoid numeric overflow error:
- SEL CAST(COUNT(*) AS DECIMAL(18,0)) FROM TABLE_NAME;
146. Will the count be same if we use COUNT(*), COUNT(20), COUNT(COLUMN_NAME) from some table ?
COUNT(*) and COUNT(20) will give same number of records however COUNT(COLUMN_NAME) will return number of records excluding NULL values. So if all the records are NOT NULL , then all three will result in same number of records.
147. Date column is not shown properly in TSA. How to overcome this issue?
CAST the date column into varchar, now the date column will be displayed in proper format.
148.How to view recent query executed in teradata ?
To view recent queries in Teradata there are two simple ways : First one is you can enable history if you are using some utility tool for running SQL’s like
- Teradata SQL Assistant. To enable history in SQL Assistant go to view –> Show History. It will list down all the SQL executed on the system . The disadvantage is it will display only those queries which were executed from that system and not all the queries which were executed on Teradata server.
- Second way of viewing recent queries in Teradata is to use DBC.QRYLOG
- SEL * FROM DBC.QRYLOG;
- This will display all the queries executed on the Teradata Server. Along with the queries it will also show which user executed the query and at what time . You can get much more information from this query. As the query will return huge number of rows if your Teradata Server is mature , you can add some filters in order to minimize the output rows.
HELPFUL DBA SQL
180.What are permanent journals in teradata?
- Journals are used to capture information about table in Teradata. In case of Permanent journals they capture details of Journal enabled tables in teradata with all the pre transaction and post transaction details .
- Journal tables are assigned PERM space and they reside in same database as of parent or they can reside on different database.
- They are mainly used for protection of data and sometimes also for disaster recovery ( fallback is better in this case )
- Permanent journal tables can be enabled or disabled by running alter database <databasename> ‘no journal’ /’ journal = <databasename.jrnltbl>’
- Arcmain utility provides the feature of backing up Journal tables
- We can find details about all journal tables present in teradata database using DBC.JOURNALS table.
181.what are different types of journals in teradata?
There are 3 different types of journals available in Teradata.
- 1. Transient Journal – This maintains current transaction history. Once the query is successful it deletes entries from its table . If the current query transaction fails, It rolls back data from its table.
- 2. Permanent Journal – This is defined when a table is created. It can store BEFORE or AFTER image of tables. DUAL copies can also be stored. Permanent Journal maintains Complete history of table.
- 3.Down AMP recovery Journal (DARJ) – This journal activates when the AMP which was supposed to process goes down. This journal will store all entries for AMP which went down. Once that AMP is up, DARJ copies all entries to that AMP and makes that AMP is sync with current environment.
182.What are benefits of Permanent Journal?
- Permits capture of before images for database rollback.
- Permits capture of after images for database roll forward.
- Permits archiving change images during table maintenance.
- Reduces need for full-table backups.
- Provides a means of recovering NO FALLBACK tables.
- Requires additional disk space for change images.
- Requires user intervention for archive and recovery activity
183.What is the purpose of Before Journal and After journal?
We use both before journal and after journal in Permanent journal to do selective or full recovery of data
184.Where does TD store transient journal?
In perm space -> dbc.transientjournal
But that special table can grow over dbc’s perm limit until the whole system runs out of perm space.
185.What is collect State in Teradata ? What it use and how it works?
- This statistics is used by the PE to optimize the plan which can be viewed by explain command.
- When collect stats is executed,the statistics is stored for use of PE.
- The Optimizer plans an execution strategy for every SQL query submitted to it.
- For the Optimizer to consistently choose the optimum strategy, it must be provided with reliable, complete, and current demographic information regarding all of these factors. The best way to assure that the Optimizer has all the information it needs to generate optimum execution strategies is to COLLECT STATISTICS.
- Statistics tell the Optimizer how many rows/ value there are.
- May improve performance of complex queries and joins.
- NUSI Bit Mapping requires collected statistics.
- Helpful in accessing a column or index with uneven value distribution.
- Stale statistics may mislead the Optimizer into poor decisions.
- Statistics remain valid across a reconfiguration of the system.
- COLLECT is resource intensive and should be done during off hours.
186.When should the statistics be collected?
Here are some excellent guidelines on when to collect statistics:
- All non unique indices
- Non index join columns
- Primary index of small tables
- Secondary indexes mentioned in the tables
- Primary Index of a Join Index
- Secondary Indices defined on any join index
- Join index columns that frequently appear on any additional join index columns that frequently appear in WHERE search conditions
- Columns that frequently appear in WHERE search conditions or in the WHERE clause of joins.
187.How to make sure Optimizer chooses NUSI over Full Table Scan (FTS)?
- A optimizer would prefer FTS over NUSI, when there are no Statistics defined on NUSI columns.
- It would prefer scanning for full table instead of going for Subtables in NUSI since optimizer does not have information about subtables of NUSI
- It is always suggested to collect statistics whenever NUSI columns are defined on the table. Verify whether index is being used by checking in Explain plan.
188.what is collected in Collect Statistics
The Teradata Parsing Engine (PE) is in charge of creating the PLAN for the AMPs to follow. The PE works best when Statistics have been collected on a table. Then it knows:
- The number of rows in the table
- The average row size Information on all Indexes in which statistics were collected
- The range of values for the column(s) in which statistics were collected
- The number of rows per value for the column(s) in which statistics were collected
- The number of NULLs for the column(s) in which statistics were collected
189.Purpose of collect statistics
- The optimizer uses this synopsis data to generate efficient table access and join plans
- When statistics are collected, Teradata does a full table scan, sorts the column or index
- You don’t COLLECT STATISTICS on all columns and indexes because it takes up too much space for unnecessary reasons,
- If you run an Explain on a query and the row estimate has No Confidence or Low Confidence, then that is a sign that no statistics were collected.
- The first time you collect statistics, you collect them at the index or column level. After that you just collect statistics at the table level and all previous columns collected previously are collected again. It is a mistake to collect statistics only once and then never do it again. COLLECT STATISTICS each time a table’s data changes by 10%
190.What does collect statistics do to help the PE come up with a better plan?•
Access Path – The PE will easily choose and use any Primary Index access (UPI or NUPI) and it will also easily choose a Unique Secondary Index (USI), but statistics really help the PE decide whether or not to do a Full Table Scan or use a Non-Unique Secondary Index (NUSI) or if it can use multiple NUSI’ s AND ed together to perform a NUSI bitmap.•
Join Method – When you collect statistics it gives Teradata a better idea whether or not to do a merge join, product join, hash join or nested join.•
Join Geography – When two rows are joined together they must physically be located on the same AMP. The only way that this happens naturally is if the join column (PK/FK) is the Primary Index of both tables. Most of the time this is not the case and Teradata must decide the Join Geography of how it will relocate the rows to co-locate them on the same AMP. Will it redistribute (rehash by the join column) one or both of the tables, or will it duplicate the smaller table across all AMPs? A redistribution or duplication are the paths to co-location.•
Join Order – All joins are performed two tables at a time. What will be the best order to join the tables together? When two or more tables are involved this becomes very important..
It is the access path, the join method, the join geography and the order that makes statistics collection so vital to all Teradata systems.
191.How do you know if Statistics were collected on a Table?
Syntax: HELP Statistics <Table Name>
192.Where Does Teradata Keep the Collected Statistics?
spread evenly across all AMPs in three tables:
- DBC.Indexes(for multi-column indexes only)
- DBC.TVFields(for all columns and single column indexes)
- DBC.StatsTbl(Teradata V14 and beyond)
193.How to Recollect STATISTICS on a Table
Here is the syntax for re-collecting statistics on a table:
COLLECT STATISTICS ON <Tablename> ;
The first time you collect statistics, you do it for each individual column or index that you want to collect on. When a table changes its data by 10% due to Inserts, Updates, or Deletes you merely use the command above, and it re-collects on the same columns and indexes previously collected on.
194.what is Random AMP Sample
Teradata Always Does a Random AMP Sample
The Parsing Engine will hash the Table ID for a table being queried and then use the Hash Map to determine which AMP will be assigned to do a Random AMP Sample for this table.
Remember that a Random AMP sample only applies to indexed columns and table row counts.
In the old days, Teradata never did a Random AMP Sample unless statistics were not collected, but these days Teradata always does a Random AMP Sample before placing the Table Header inside each AMP’s FSG Cache.
This allows Teradata to compare these statistics with collected statistics to determine if statistics are old and stale.
If the statistics are determined to be out of date then the Random AMP Sample is used.Random Sample is Kept in the Table Header in FSG Cache
Teradata compares the collected statistics to a Random AMP Sample (obtained by sampling a single AMP before placing the Table Header in FSG Cache). This compare determines if the statistics will be used or if they should be replaced by the sample.
195.Multiple Random AMP Samplings
The PE does Random AMP Sampling based on the Table ID.
The Table ID is hashed, and that AMP is always selected as the sampled AMP for that table. This assures that no single AMP will be tasked for too many tables, but if the table is badly skewed this can confuse the PE.
So now, more than one AMP can be sampled when generating row counts for a query plan for much better estimations on row count, row size, and rows per value estimates per table.
In the DBS Control area, field 65 can now set the standard for how AMPs are sampled.
65. RandomAmpSampling – this field determines the number of AMPs to be sampled for getting the row estimates of a table.
The valid values are D, L, M, N or A.
D – The default is one AMP sampling (D is the default unless changed.)
L – Maximum of two AMPs sampling
M – Maximum of five AMPs sampling
N – Node Level sampling (all the AMPs in a node are sampled).
A – System Level sampling (all the AMPs in the system are sampled).
Multiple AMPs can now be used for the random AMP sample so a higher number of AMPs sampled will provide better estimates to counter skewed results. But, it can cause short running queries to run slower just so long running queries can run faster.
Random AMP Estimates for NUSI Secondary Indexes
The Random AMP reads a couple of cylinders and some data NUSI blocks and then does some simple math to estimate the Rows Per NUSI Value. The PE then knows how strong or weak the WHERE clause is using the NUSI and if it should even use the NUSI. This is the most important decision for the Parsing Engine. Should it just do a Full Table Scan or use the NUSI? That is the biggest reason the PE needs statistics. That is why you should always collect statistics on all NUSI indexes.
You really only need to collect statistics on a Unique Secondary Index column if there are a lot of SQL statements on non-equality conditions such as range queries.
There’s No Random AMP Estimate for Non-Indexed Columns
Teradata does not do a Random AMP Sample for non-indexed columns that are used in the WHERE clause of the SQL
Today’s Teradata systems always perform a random AMP sample even if tables have statistics. Then, they compare the random AMP sample with the statistics to determine if the statistics are stale.
A random AMP is selected for a random sample. Two things happen:
1) Indexes are sampled on the random AMP, and the PE estimates based on the total number of AMPs in the system.
2) If a column in the WHERE clause of the SQL is not an Index, the PE assumes that 10% of the rows will come back. If two columns are in the WHERE clause, then it assumes 7.5% of the rows will come back. If three columns are in the WHERE Clause, it assumes 5%.
196.When to COLLECT STATISTICS Using only a SAMPLE
You might consider Collecting Statistics with SAMPLE if:
|You are collecting statistics on a very large table.|
|When collecting statistics becomes a problem with system performance or cost because the system is so busy.|
Don’t consider Collecting Statistics with SAMPLE if:
|The tables are small.|
|To replace all existing full scan Collect Statistics.|
|If the column’s data is skewed badly.|
COLLECT STATISTICS can be very time consuming because it performs a full table scan and then performs a lot of statistical calculations. Because Collect Statistics runs infrequently and benefits query optimization, it is considered a necessary task. Without statistics, query performance will suffer. The bad news about sampled statistics is that they may not be as accurate, which could negatively affect the PE’s plans. In most cases, sampled statistics are better than no statistics. Don’t use Sample unless necessary!
Examples of COLLECT STATISTICS Using only a SAMPLE
Sampled statistics are generally more accurate for data that is not skewed. For example, columns or indexes that are unique or nearly unique are not skewed. Because the PE needs to be aware of skewed data, you should not collect with sample on skewed data. That is why sampling is generally more appropriate for indexes than non-indexed column(s). If you recollect statistics on a Sample, it recollects with the same Sample!
197.How to Collect Statistics on a PPI Table on the Partition
Here is the syntax for collecting statistics on a PPI table.
COLLECT STATISTICS on <Tablename> COLUMN PARTITION;
Here is an actual example COLLECT STATISTICS on Order_Table_PPI COLUMN PARTITION;
Three reasons to Collect on the Partition:
The Parsing Engine will have a better plan for PPI Tables.
This helps the most with Partition Elimination on Range Queries.
This is especially helpful when a table has a lot of empty partitions.
The Parsing Engine can use this information to better estimate the query cost when there are a significant number of empty partitions. If PARTITION statistics are not collected, empty partitions may cause the Parsing Engine to underestimate the number of rows in a partition. You shouldn’t use WITH SAMPLE to collect on Partitions.
198.Teradata V14 Statistics Enhancements•
Teradata V12 and V13 Statistics EnhancementsI
- n V12, Extrapolate Statistics is designed to more accurately provide for a statistical estimate for date range-based queries that specify a “future” date that is outside the bounds of the current statistics. This results in less re-collections
- .In V12, Stale Statistics Detection compares the Random AMP Sample with the statistics collected and determines if they are stale, and should not be used.
- In V13, Statistics can now be collected on Volatile Tables.
- In V13, PARTITION statistic capabilities have been added to Global Temporary Tables.
- In V13, Multi-Column statistics are now available on Hash and Join Indexes.
- In V13, Sample Statistics are available on Tables, Volatile Tables, Global Temporary Tables, Hash Indexes and Join Indexes, including the Partition Columns.
Teradata V14 Statistics Enhancements
- There is now a SUMMARY option to collect table-level statistics.•
- SYSTEM SAMPLE option allows the system to determine the sampled system percentage.• Sampling options have been enhanced (e.g., SAMPLE n PERCENT).•
- Statistics are stored in DBC.StatsTbl to reduce access contention and improve performance.•
- New views DBC.StatsV, DBC.ColumnStatsV, DBC.MultiColumnStatsV, and IndexStatsV.•
- SHOW STATISTICS statement reports detailed statistics in plain text or XML formatting.•
- Internal PE enhancements for histogram structure and use,
- including:•Storing statistics data in their native Teradata data types without losing precision• Enhanced extrapolation methods for stale statistics•
- Maintaining statistics history
- Teradata V14 now allows you to determine a sampling percentage for sampled statistics. You can even collect/recollect either summary statistics or both full and summary statistics combined. You can now collect statistics on global temporary tables, and you can provide a name for statistics collected while also being able to specify the column ordering for multicolumn statistics. There is also a dedicated statistics cache that is designed to improve query optimization time.
To collect sample statistics using the system default sample:
COLLECT STATISTICS USING SYSTEM SAMPLE COLUMN (Product_ID) ON Sales_Table;
To collect sample statistics by scanning 15 percent of the rows and use 100 intervals:
COLLECT STATISTICS USING SAMPLE 15 PERCENT AND MAXINTERVALS 100 COLUMN(Product_ID) AS Product_Stats ON Sales_Table;
Teradata V14 Summary Statistics
New in Teradata 14.0, table-level statistics known as “summary statistics” are collected whenever column or index statistics are collected. Summary statistics do not cause their own histogram to be built, but rather they create important facts about the table undergoing collection that are held in the new DBC.StatsTbl. Here are some of the items in “summary statistics”:
- Row count
- Average block size
- block level compression metrics
SHOW SUMMARY STATISTICS VALUES ON Employee_Table;
Teradata V14 MaxValueLength
USING MAXVALUELENGTH 50
COLUMN ( Product_Name)
ON Product_Table ;
Before V14, whenever you collected statistics, Teradata only placed the first 16 bytes in the statistics. Long names were cut off. Now, the default is 25 bytes, but you can use theMaxValueLength keyword (example above) to specify the length you want.
MAXVALUELENGTH lets you expand the length of the values contained in the histogram for that statistic. The new default length is 25 bytes when previously it was 16. If needed, you can specify well over 1000 bytes for a maximum value length. The 16-byte limit on value sizes in earlier releases was always padded to 16 bytes, but now the length can be longer, however no padding is done.
Teradata V14 MaxIntervals
USING MaxIntervals 500
COLUMN ( Last_Name)
ON Employee_Table ;
Before V14, whenever you collected statistics Teradata did a full table scan on the values, sorted them, and then placed them into 200 intervals. Now, the default is 250 intervals, but you can specify (example above) the number of intervals you desire.
Each statistics interval highlights its single most popular value, and the number of rows that carry that value are recorded. The rest of the values in the interval are estimated. By increasing the number of intervals, the optimizer can accurately get a better row count for a greater number of the most popular values. A larger number of intervals can be useful if you have widespread skew on a column or index you are collecting statistics on and you want more individual high-row-count values to be represented in the histogram. The range is 0 – 500 for MaxIntervals.
Teradata V14 Sample N Percent
USING Sample 20 Percent
COLUMN ( Last_Name)
ON Employee_Table ;
Using Sample before defaulted each time to the system parameter, but now you can specifically state the percent you want for each column or index.
SAMPLE n PERCENT allows you to specify sampling at the individual statistics collection level, rather than at the system level. Now, different levels of statistics sampling to different columns and indexes can be performed. The better you get at knowing your data and the queries upon them, the more you can specifically use the sampling to better help the PE.
Teradata V14.10 Statistics Collection Improvements
Expressions can now be specified in COLLECT STATS statements.
Expressions include: Substr(), MOD, CONCATENATION, Format Conversion expressions like Uppercase, Lowercase, Data type Conversions, CASE-Expressions, BEGIN and END expressions of PERIOD types.
COLLECT STATISTICS USING SYSTEM SAMPLE COLUMN
(EXTRACT(MONTH FROM Order_Date), Order_Total) AS Month_Plus_TotalON Order_Table ;
It provides automatic downgrading of Full Stats to Sample Stats if column is eligible and system determines if full statistics would be better off if avoided.
Teradata will detect the spool space required before attempting the statistics collection. If it is determined that there will not be enough space, then Teradata will:
- Build a new layer of local statistics on one of the AMPS, and then the global statistics will be based on local statistics.
It provides statistics re-collection performance improvements to reduce statistics re-collection overhead by automatically skipping the recollection process when not needed.
Above are the features of Teradata V14.10 Statistics Collection Improvements.
Teradata V14.10 Statistics Collection Threshold Examples
TIME and CHANGE based THRESHOLD option can be used to reduce statistics re-collection overhead by automatically skipping the recollection process when not necessary.
For example, if the user sets a TIME THRESHOLD for specific statistics collection to be 5 days, and issues a collect stats after 3 days, the system will not collect statistics.
For example, if a user sets a CHANGED based THRESHOLD of 10%, and the data has not changed by 10%, the system will not collect statistics. Insert and Updates are taken into consideration to determine data growth.
COLLECT STATISTICS USING THRESHOLD 10 PERCENT AND THRESHOLD 5 DAYS
COLUMN (Customer_Number) ON Customer_Table ;
If a COLLECT STATISTICS statement is issued less than 5 days after the above statement, or if the data has not changed by 10%, the collection will not take place.
Here is SQL that will override the above statement:
COLLECT STATISTICS USING NO THRESHOLD
FOR CURRENT COLUMN (Customer_Number) ON Customer_Table ;
Teradata V14.10 AutoStats feature
Teradata V14.10 Autostats identifies and collects missing statistics needed and detects stale statistics for refreshing. If statistics are not being used by optimizer, they will be removed.
A new repository of a system supplied database named TDSTATS stores metadata for all stats collections. This is created by a new process called DIPSTATS.
External stored procedures (XSPs) perform the stats management process.
There have been DBQL Enhancements to the Log Optimizer statistics recommendations and usage with dedicated DBQL logging option. DBQL will analyze query plans that have been logged to recommend new stats or identify unused stats.
Teradata recommends that customers should NOT replace of all their existing stats management procedures, but instead begin experimenting with AutoStats and build upon the experience.
Above are the features of Teradata V14.10 AutoStats.
When Teradata Collects Statistics it creates a Histogram
1. Highest Sorted Value in the Interval
2. Most Frequent Value in the Interval
3. Rows with the Most Frequent Value
4. Other Values in the Interval
5. Number of Rows of other Values