TERADATA COMPRESSION

The COMPRESS clause works in two different ways:

When issued by itself (without a value or values), COMPRESS causes all NULL values for that column to be compressed to zero space. The Presence Bit for Nulls is used so no cost.

When issued with value (e.g., COMPRESS ‘Ohio’, ‘Iowa’, ‘Georgia’), the COMPRESS clause will compress every occurrence of Ohio, Iowa, and Georgia in that column to zero space as well as cause every NULL value to be compressed. You can compress up to 255 values per column, and the cost is between 1 and 8 bits per row of additional cost.

You CANNOT compress the following:•

  • Components of the primary index•
  • Identity columns•
  • Volatile tables•
  • Derived-tables•
  • Columns defined with a UDT,
  • Period, Geospatial, BLOB, or CLOB data type.• Both the referencing and referenced columns with Standard RI (Referential Integrity) cannot be compressed, but the Batch and Soft RI columns can be compressed.
  • One cannot compress all kinds of tables; volatile ones are exempt
  • One cannot compress on index or partitioning columns
  • Depending on the Teradata Release you are working with, data types such as VARCHAR or TIMESTAMP cannot be compressed
  • You can add compression values only up to the point where the table header section’s space is used up

A budget approach for getting it done

To execute the following procedure for column selection, basic documentation and recommendation you do not need to have a detailed technical understanding of what exactly happens where. Use the approach presented below as a “budget” one when you need to deliver some results quite soon.

Start with all columns of all tables of your database and apply the following filters:

Technical filter1:

Exclude all columns that are not eligible for compression because of their technical role (primary index, partition). This is a hard filter at this point. Only a change of the physical data model could alter your decision here.

Technical filter 2:

Exclude all columns that do not have the right data type for compression. This filter is less restrictive as data types can be changed if the original design was more of an educated guess than a prudent decision.

Next, we can apply a global minimum benefit filter on the tables by eliminating all those below a current size as well as those with no further growth expected, given size. If you do not want to exclude tables right away or have no “small” ones in your database, you can rank them by non-technical urgency criteria such as usage in reporting.

For all the table columns left, we can now have a look at the degree of concentration of values in the column. A column with a very high concentration on only few values is a more promising candidate for compression than a column whose millions of different values spread evenly. We can apply a common measure from the field of market concentration for our purpose: the Herfindahl-Index (HI). The closer the value of a column is to 1 the more potential for compression.

In order to store the numeric basis of our decision, we keep a record of the HI value in a register table like this:

— COLUMN DEMOGRAPHICS REGISTER

CREATE MULTISET TABLE P_1111_A.DMnn_COL_HI_INDEX ,NO FALLBACK ,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT

(

STATUS_DT DATE FORMAT ‘YYYY-MM-DD’,

TableName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

ColumnName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

HI_INDEX DECIMAL(38,18))

PRIMARY INDEX ( TableName, ColumnName );

We insert the HI values by first preparing the insert statement for all relevant tables that we will then execute:

— HHI INDEX CALCULATION PREPARATION

sel TABLENAME, columnname,

‘INSERT INTO P_1111_A.DMnn_COL_HI_INDEX SELECT CURRENT_DATE, ”’|| trim(tablename) ||”’,”’|| trim(columnName) || ”’,’|| trim(columnName) || ‘,  SUM (CAST( (ANZAHL/GES)**2 AS DECIMAL(38,18)) ) AS HI_INDEX

FROM

( SELECT ‘ ||trim(columnName)|| ‘, CAST(COUNT(*)  AS DECIMAL(38,18))  ANZAHL FROM ‘ || trim(databasename) ||’.’|| trim(tablename) ||’  GROUP BY 1 ) x

JOIN

(SELECT CAST(COUNT(*)  AS DECIMAL(38,18)) AS GES FROM ‘ || trim(databasename) ||’.’|| trim(tablename) || ‘ ) GES

ON 1=1; ‘  AS HHI

from dbc.columns

where databasename=’P_1111_A’

and columnname NOT IN (‘X1_ID’,’X2_ID’,’X3_ID’,’X4_ID’,’X5_DT’,’X6_DT’,’X7_DT’)

and substr(TABLENAME,1,1) IN (‘P’,’T’)

and ColumnType NOT IN (‘CV’,’TS’,’BO’)

order by 1,2

;

Note that the query contains a handful of constraints on the dbc column list to remove mere technical fields, exclude non-production tables and columns with non-compressible fields.

Per table, we select those columns with an HI high enough for the preparation of an insert statement of compression value candidates. For documentation purposes, we create a second register table first:

— COLUMN COMPRESSION CANDIDATES

CREATE MULTISET TABLE P_1111_A.DMnn_COL_COMP_CAND ,NO FALLBACK ,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT

(

STATUS_DT DATE FORMAT ‘YYYY-MM-DD’,

TableName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

ColumnName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

COL_VALUE CHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,

VALUE_SHARE   DECIMAL(18,2)  )

PRIMARY INDEX ( TableName, ColumnName, COL_VALUE );

— FIND ALL COMPRESSION VALUE CANDIDATES

sel TABLENAME, columnname,

‘INSERT INTO P_1111_A.DMnn_COL_COMP_CAND  SELECT CURRENT_DATE, ”’|| trim(tablename) ||”’,’|| trim(columnName) || ‘, CAST( (ANZAHL/GES)*100 AS DECIMAL(38,2)) AS VALUE_SHARE

FROM

( SELECT ‘ ||trim(columnName)|| ‘, CAST(COUNT(*)  AS DECIMAL(38,18))  ANZAHL FROM ‘ || trim(databasename) ||’.’|| trim(tablename) ||’  GROUP BY 1 ) x

JOIN

(SELECT  CAST(COUNT(*)  AS DECIMAL(38,18)) AS GES , CAST(COUNT(DISTINCT ‘ ||trim(columnName)|| ‘)  AS DECIMAL(38,18)) AS DDVAL FROM ‘ || trim(databasename) ||’.’|| trim(tablename) || ‘ ) GES

ON 1=1

WHERE CAST( (ANZAHL/GES)*100 AS DECIMAL(38,2)) > CAST( (1/DDVAL)*100 AS DECIMAL(38,2))

ORDER BY 3 DESC

;’   AS COMPRESSION_VALUES

from dbc.columns

where databasename=’P_1111_A’

and columnname NOT IN (‘X1_ID’,’X2_ID’,’X3_ID’,’X4_ID’,’X5_DT’,’X6_DT’,’X7_DT’)

and substr(TABLENAME,1,1) IN (‘P’,’T’)

and ColumnType NOT IN (‘CV’,’TS’,’BO’)

order by 1,2

;

You can use both register tables to put together a compression recommendation.

Register tables allow for regular reinsertion of new layers of demographics via the STATUS_DT so that demographic shifts can be documented.

Before you prepare your final list for deployment, put your compression project to a long-term perspective:

How often will you be able to update your compression list?

How often do you expect it to be necessary?

How volatile is the column value set over time?

Yes-No flags will contain Yes or No as long as they exist, whereas daily account balances or order numbers are more transitory in nature.

These considerations cannot be taken over by Teradata. Rather, they should emerge from interaction with business representatives and management or from your own curiosity if the latter are out of reach.

The COMPRESS clause works in two different ways:

When issued by itself (without a value or values), COMPRESS causes all NULL values for that column to be compressed to zero space. The Presence Bit for Nulls is used so no cost.

When issued with value (e.g., COMPRESS ‘Ohio’, ‘Iowa’, ‘Georgia’), the COMPRESS clause will compress every occurrence of Ohio, Iowa, and Georgia in that column to zero space as well as cause every NULL value to be compressed. You can compress up to 255 values per column, and the cost is between 1 and 8 bits per row of additional cost.

You CANNOT compress the following:•

  • Components of the primary index•
  • Identity columns•
  • Volatile tables•
  • Derived-tables•
  • Columns defined with a UDT,
  • Period, Geospatial, BLOB, or CLOB data type.• Both the referencing and referenced columns with Standard RI (Referential Integrity) cannot be compressed, but the Batch and Soft RI columns can be compressed.
  • One cannot compress all kinds of tables; volatile ones are exempt
  • One cannot compress on index or partitioning columns
  • Depending on the Teradata Release you are working with, data types such as VARCHAR or TIMESTAMP cannot be compressed
  • You can add compression values only up to the point where the table header section’s space is used up

A budget approach for getting it done

To execute the following procedure for column selection, basic documentation and recommendation you do not need to have a detailed technical understanding of what exactly happens where. Use the approach presented below as a “budget” one when you need to deliver some results quite soon.

Start with all columns of all tables of your database and apply the following filters:

Technical filter1:

Exclude all columns that are not eligible for compression because of their technical role (primary index, partition). This is a hard filter at this point. Only a change of the physical data model could alter your decision here.

Technical filter 2:

Exclude all columns that do not have the right data type for compression. This filter is less restrictive as data types can be changed if the original design was more of an educated guess than a prudent decision.

Next, we can apply a global minimum benefit filter on the tables by eliminating all those below a current size as well as those with no further growth expected, given size. If you do not want to exclude tables right away or have no “small” ones in your database, you can rank them by non-technical urgency criteria such as usage in reporting.

For all the table columns left, we can now have a look at the degree of concentration of values in the column. A column with a very high concentration on only few values is a more promising candidate for compression than a column whose millions of different values spread evenly. We can apply a common measure from the field of market concentration for our purpose: the Herfindahl-Index (HI). The closer the value of a column is to 1 the more potential for compression.

In order to store the numeric basis of our decision, we keep a record of the HI value in a register table like this:

— COLUMN DEMOGRAPHICS REGISTER

CREATE MULTISET TABLE P_1111_A.DMnn_COL_HI_INDEX ,NO FALLBACK ,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT

(

STATUS_DT DATE FORMAT ‘YYYY-MM-DD’,

TableName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

ColumnName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

HI_INDEX DECIMAL(38,18))

PRIMARY INDEX ( TableName, ColumnName );

We insert the HI values by first preparing the insert statement for all relevant tables that we will then execute:

— HHI INDEX CALCULATION PREPARATION

sel TABLENAME, columnname,

‘INSERT INTO P_1111_A.DMnn_COL_HI_INDEX SELECT CURRENT_DATE, ”’|| trim(tablename) ||”’,”’|| trim(columnName) || ”’,’|| trim(columnName) || ‘,  SUM (CAST( (ANZAHL/GES)**2 AS DECIMAL(38,18)) ) AS HI_INDEX

FROM

( SELECT ‘ ||trim(columnName)|| ‘, CAST(COUNT(*)  AS DECIMAL(38,18))  ANZAHL FROM ‘ || trim(databasename) ||’.’|| trim(tablename) ||’  GROUP BY 1 ) x

JOIN

(SELECT CAST(COUNT(*)  AS DECIMAL(38,18)) AS GES FROM ‘ || trim(databasename) ||’.’|| trim(tablename) || ‘ ) GES

ON 1=1; ‘  AS HHI

from dbc.columns

where databasename=’P_1111_A’

and columnname NOT IN (‘X1_ID’,’X2_ID’,’X3_ID’,’X4_ID’,’X5_DT’,’X6_DT’,’X7_DT’)

and substr(TABLENAME,1,1) IN (‘P’,’T’)

and ColumnType NOT IN (‘CV’,’TS’,’BO’)

order by 1,2

;

Note that the query contains a handful of constraints on the dbc column list to remove mere technical fields, exclude non-production tables and columns with non-compressible fields.

Per table, we select those columns with an HI high enough for the preparation of an insert statement of compression value candidates. For documentation purposes, we create a second register table first:

— COLUMN COMPRESSION CANDIDATES

CREATE MULTISET TABLE P_1111_A.DMnn_COL_COMP_CAND ,NO FALLBACK ,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT

(

STATUS_DT DATE FORMAT ‘YYYY-MM-DD’,

TableName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

ColumnName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,

COL_VALUE CHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,

VALUE_SHARE   DECIMAL(18,2)  )

PRIMARY INDEX ( TableName, ColumnName, COL_VALUE );

— FIND ALL COMPRESSION VALUE CANDIDATES

sel TABLENAME, columnname,

‘INSERT INTO P_1111_A.DMnn_COL_COMP_CAND  SELECT CURRENT_DATE, ”’|| trim(tablename) ||”’,’|| trim(columnName) || ‘, CAST( (ANZAHL/GES)*100 AS DECIMAL(38,2)) AS VALUE_SHARE

FROM

( SELECT ‘ ||trim(columnName)|| ‘, CAST(COUNT(*)  AS DECIMAL(38,18))  ANZAHL FROM ‘ || trim(databasename) ||’.’|| trim(tablename) ||’  GROUP BY 1 ) x

JOIN

(SELECT  CAST(COUNT(*)  AS DECIMAL(38,18)) AS GES , CAST(COUNT(DISTINCT ‘ ||trim(columnName)|| ‘)  AS DECIMAL(38,18)) AS DDVAL FROM ‘ || trim(databasename) ||’.’|| trim(tablename) || ‘ ) GES

ON 1=1

WHERE CAST( (ANZAHL/GES)*100 AS DECIMAL(38,2)) > CAST( (1/DDVAL)*100 AS DECIMAL(38,2))

ORDER BY 3 DESC

;’   AS COMPRESSION_VALUES

from dbc.columns

where databasename=’P_1111_A’

and columnname NOT IN (‘X1_ID’,’X2_ID’,’X3_ID’,’X4_ID’,’X5_DT’,’X6_DT’,’X7_DT’)

and substr(TABLENAME,1,1) IN (‘P’,’T’)

and ColumnType NOT IN (‘CV’,’TS’,’BO’)

order by 1,2

;

You can use both register tables to put together a compression recommendation.

Register tables allow for regular reinsertion of new layers of demographics via the STATUS_DT so that demographic shifts can be documented.

Before you prepare your final list for deployment, put your compression project to a long-term perspective:

How often will you be able to update your compression list?

How often do you expect it to be necessary?

How volatile is the column value set over time?

Yes-No flags will contain Yes or No as long as they exist, whereas daily account balances or order numbers are more transitory in nature.

These considerations cannot be taken over by Teradata. Rather, they should emerge from interaction with business representatives and management or from your own curiosity if the latter are out of reach.