Data Warehouse & Business Intelligence

Monday, 21 October 2013

Adding Residual Conditions to a Join

Most of the examples in this book have included all rows from the tables being joined. However, in the world of Teradata with millions of rows being stored in a single table, additional comparisons are probably needed to reduce the number of rows returned. There are two ways to code residual conditions. They are: the use of a compound condition using the ON, or a WHERE clause may be used in the new JOIN. These residual conditions are in addition to the join equality in the ON clause.

Consideration should be given to the type of join when including the WHERE clause. The following paragraphs discuss the operational aspects of mixing an ON with a WHERE for INNER and OUTER JOIN operations.

INNER JOIN

The WHERE clause works exactly the same when used with the INNER JOIN as it does on all other forms of the SELECT. It eliminates rows at read time based on the condition being checked and any index columns involved in the comparison.

Normally, as fewer rows are read, the faster the SQL will run. It is more efficient because fewer resources such as disk, I/O, cache space, spool space, and CPU are needed. Therefore, whenever possible, it is best to eliminate unneeded rows using a WHERE condition with an INNER JOIN. I like the use of WHERE because all residual conditions are located in one place.

The following samples are the same join that was performed earlier in this chapter. Here, one uses a WHERE clause and the other a compound comparison via the ON:

SELECT   Customer_name
          ,Order_number
          ,Order_total  (FORMAT  '$$$,$$9.99)
FROM  Customer_table AS cust   INNER JOIN  Order_table AS ord
        ON   cust.customer_number = ord.customer_number
WHERE Customer_name    LIKE 'Billy%
ORDER BY 3 DESC;

SELECT   Customer_name
         ,Order_number
         ,Order_total   (FORMAT  '$$$,$$9.99)
FROM  Customer_table AS cust   INNER JOIN  Order_table AS ord
      ON   cust.customer_number = ord.customer_number
              AND  Customer_name LIKE
ORDER BY 3 DESC;

2 Rows Returned

Customer_name	Order_number	Order_total
Billy's Best Choice	123456	$12,347.53
Billy's Best Choice	123512	$8,005.91

The output is exactly the same with both coding methods. This can be verified using the EXPLAIN. We recommend using the WHERE clause with an inner join because it consolidates all residual conditions in a single location that is easy to find when changes are needed. Although there are multiple ON comparisons, there is only one WHERE clause.

Saturday, 19 October 2013

How to identify PPI columns?

Select databasename , tablename , columnposition ,columnname from dbc.indices
where indextype ='Q'
order by 1 ,2,3 ;

How to skip the header row in the fastload script

RECORD 2; /* this skips first record in the source file */
DEFINE ...

How many types of Skew exist?

If you utilized unequally TD resources (CPU,AMP,IO,Disk and etc) this is called skew exists. Major are 3 types of skews (CPU skew, AMP/Data skew, IO Skew).

-Data skew?
When data is not distributed equally on all the AMPs.
-Cpu skew?
Who is taking/consuming more CPU called cpu skew.
-IO skew?
Who perform more IO Operation? Resulting in IO Skew

What is a role?

A role is a set of access rights which can be assigned to the users. They indirectly help in performance by reducing the number of rows entered in DBC.accessrights

Friday, 18 October 2013

Teradata error : 7535: Operation not allowed: table header has invalid partitioning.

This can occur due to a restore/copy of the table to another system with different hardware or operating system. In this case, the partitioning expressions need to be regenerated and updated in the table headers of the table.

Example:

SELECT Failed. 7535: Operation not allowed: MTD_DB.MTD_MSTR table header has invalid partitioning.

solution to this is validating table header with primary index of table.
For that we can ALTER the TABLE with the REVALIDATE PRIMARY INDEX option.

here we go... command is ..

alter table Mtd_tb.mtd_mstr revalidate primary index;

Get current session information in Teradata

HELP SESSION;

Query to find out biggest table in teradata

SELECT TOP 1 DATABASENAME, TABLENAME ,SUM(CURRENTPERM) FROM DBC.TABLESIZE GROUP BY 1,2 ORDER BY 3 DESC;

DBQL: Database Query Log

Database Query Log is an extremely useful tool to the Teradata DBA in terms of analysing resource usage as well as for security purposes.
Basically it is a log of queries submitted by users of the Teradata Database. It can record varying levels of detail but includes CPU usage, I/O and elapsed time and details of the objects accessed.

DBQL Step data can help a DBA drill deeper into a problem query by showing which query step is causing the problem skewing, product joining or large scan. This data, when matched up with the EXPLAIN plan, can provide the insight needed to tune the query.

Highlights
Allow you to log query processing activity for later analysis. SQL and other metrics are captured

DBQL captures details and metrics for workload analysis and performance tuning purposes

DBQL comes with empty pre-defined tables.

The DBA sets up rules to what information DBQL should capture and then turns DBQL ON !

The empty pre-defined tables now begin to fill up with the information the DBA said to capture

The DBA can tell DBQL to capture SQL statements, what objects were referenced, thresholds and summary rules, Explain steps, query exceptions
defined in workloads, and other important metrics

THe DBA has choices and can turn on or turn off DBQL or the DBA can only capture one person, an Accounting group, workload, or everyone. DBQL works
great to help define of redefine workloads!

Purpose/Use of collect statistics

The following list is a very high-level description of the most important purposes for column
and index statistics.

• The Optimizer uses statistics to decide whether it should generate a query plan that use a
secondary, hash, or join index instead of performing a full-table scan.

• The Optimizer uses statistics to estimate the cardinalities of intermediate spool files based
on the qualifying conditions specified by a query.
The estimated cardinality of intermediate results is critical for the determination of both
optimal join orders for tables and the kind of join method that should be used to make
those joins.
For example, should 2 tables or spool files be redistributed and then merge joined, or
should one of the tables or spool files be duplicated and then product joined with the
other. Depending on how accurate the statistics are, the generated join plan can vary so
greatly that the same query can take only seconds to complete using one join plan, but take
hours to complete using another.

• For PPI tables, statistics collected on the PARTITION system-derived column permit the
Optimizer to better estimate costs.
Teradata Database also uses PARTITION statistics for estimates when predicates are based
on the PARTITION column, for example: WHERE PARTITION IN (3,4).

• The Teradata Statistics Wizard uses existing statistics, which it assumes are current by
default, as a basis for recommending other statistics to be collected for a query or query
workload (see “Special Considerations for Sampled and Stale Statistics” on page 668).
This is yet another reason you should keep your statistics as current as you possibly can.

Most ETL systems follow first with dimension load then fact load. In case if fact arrives first then how you load?

We have to just set the early arriving fact as “unknown” and set it to a unknown dimension key (usually –1) in the fact table.

Another option we can use is the idea of an inferred dimension. As you load your fact table data you do a lookup to your dimension.
If you have a match, cool, take that key and move on. If you don’t have a match, instead of setting the key to –1 (unknown), do this:

1) Insert a new dimension record with your business key from your fact table
2) Grab the newly created dimension key from the record you just inserted

3) Merge the key back into your fact data pipeline.

Saturday, 12 October 2013

Volatile Temporary Table (VTT) vs Global Temporary Table (GTT)

VTT are session scoped whereas GTT is also a session scoped but its DDL can be accessed by other sessions because it is stored in data dictionary.

GTT takes up space from temporary space where as VTT takes up space from spool.

GTT survive TD system restarts where as VTT does not.

GTT allows index to be created where as VTT does not.

In a single session 2000 GTT can be materialized where as in case of VTT the count is only 1000.

GTT allows some constraints to be created on where as VTT does not.

What is RAID, What are the types of RAID?

Redundant Array of Inexpensive Disks (RAID) is a type of protection available in Teradata. RAID provides Data protection at the disk Drive level. It ensures data is available even when the disk drive had failed.

There are around 6 levels of RAID ( RAID0 to RAID5) .

Teradata supports two levels of RAID protection

RAID 1 - Mirrored copy of data in other disk

RAID 5 - Parity bit (XOR) based Data protection on each disk array.

One of the major overhead's of RAID is Space consumption

Saturday, 5 October 2013

What are the new features of Teradata 13.10 ?

Here are few new features of Teradata 13.10
Compression on varchar data type fields

New calendar functions in TD13.10

There are lot of calendar functions introduced in 13.10 which are

· day_of_week

· day_of_month

· day_of_year

· day_of_calendar

· weekday_of_month

· week_of_month

· week_of_year

· week_of_calendar

· month_of_quarter

· month_of_year

· month_of_calendar

· quarter_of_year

· quarter_of_calendar

· year_of_calendar

Compression

1. Multi valued Compression.

This is one of my most favourite functionality in Teradata. Since its introduction in V2R5, it has gained popularity considering the enormous space savings and reduced I/O in huge data warehouses.

Top keep upto its promise, Teradata has gone one step further to include more data types for compression.

Previously multivalued compression was limited to fixed length columns types.

In TD13.10, some new features in MVC include

- Support for varchar and varbyte

- Longer columns support upto 510 characters. Previously 255 in Teradata 12, 13

2. Algorithmic compression

Users can now write their own compression and decompression algorithms for characters.

These will have to be implemented in UDF and needs to be tested thoroughly before implementing it in full scale.

There are some samples functions Teradata has provide and can be found in SYSLIB.

TransUnicodeToUTF8, TransUTF8ToUnicode, LZCOMP, LZDECOMP, LZCOMP_L, LZDECOMP_L, CAMSET, DECAMSET, CAMSET_L,DECAMSET_L.

3. Block level compression (applicable to DBA’s, Capacity planners)

BLC is applied at data blocks and allows row data to be stored in datablocks* in a new compressed format.

To enable BLC on tables, new command "COMPRESS" is available in FERRET utility. After BLC is enabled on table, all tows added to table are compressed automatically. Similarly "UNCOMPRESS" command can be used to uncompress packed Data.

BLC is one of most straight forward methods where one doesn’t have to worry about type of data across various columns in table. Row header and internal Teradata structures are also compressed using BLC.

BLC plays major role in saving lot of I/O and Space but at same time it can have some performance implications as more CPU can be utilized to compress and uncompress data dynamically.

Tuesday, 17 September 2013

POSITION and INDEX functions

The POSITION function Returns the actual position of the character which occurs first. POSITION function is ANSI standard.

Teradata has an equivalent function called INDEX.

Both the POSITION and INDEX functions returns position of character's first occurrence in a string.

Examples for the POSITION function

SELECT POSITION( 'u' IN 'formula'); Displays Result as '5'

Examples for the INDEX function

SELECT INDEX('formula', 'u'); Displays Result as '5'

Monday, 16 September 2013

What is ACE in AB Initio?

What is residual condition in Teradata ?

Parse trees are always presented upside down, so query
execution begins with the lower cluster of operations and terminates with the upper cluster. In an EXPLAIN of query, the expressions in the upper cluster would be referred to as residual conditions.

How do we handle if DML changing dynamically

There are many ways to handle the DMLs which changes dynamically with in a single file. Some of the suitable methods are to use a conditional DML or to call the vector functionality while calling the DMLs.

What is AB_LOCAL expression where do you use it in ab-initio?

ablocal_expr is a parameter of table component of Ab Initio.ABLOCAL() is replaced by the contents of ablocal_expr.Which we can make use in parallel unloads.There are two forms of AB_LOCAL() construct, one with no arguments and one with single argument as a table name(driving table).

The use of AB_LOCAL() construct is in Some complex SQL statements contain grammar that is not recognized by the Ab Initio parser when unloading in parallel. You can use the ABLOCAL() construct in this case to prevent the Input Table component from parsing the SQL (it will get passed through to the database). It also specifies which table to use for the parallel clause.

How do you improve the performance of a graph?

There are many ways the performance of the graph can be improved.

1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimise the number of sort components
4) Minimise sorted join component and if possible replace them by in-memory join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port
8) For large dataset don't use broadcast as partitioner
9) Minimise the use of regular expression functions like re_index in the trasfer functions
10) Avoid repartitioning of data unnecessarily

Lables