Lables

Monday 21 October 2013

Adding Residual Conditions to a Join

Most of the examples in this book have included all rows from the tables being joined. However, in the world of Teradata with millions of rows being stored in a single table, additional comparisons are probably needed to reduce the number of rows returned. There are two ways to code residual conditions. They are: the use of a compound condition using the ON, or a WHERE clause may be used in the new JOIN. These residual conditions are in addition to the join equality in the ON clause.
Consideration should be given to the type of join when including the WHERE clause. The following paragraphs discuss the operational aspects of mixing an ON with a WHERE for INNER and OUTER JOIN operations.

INNER JOIN

The WHERE clause works exactly the same when used with the INNER JOIN as it does on all other forms of the SELECT. It eliminates rows at read time based on the condition being checked and any index columns involved in the comparison.
Normally, as fewer rows are read, the faster the SQL will run. It is more efficient because fewer resources such as disk, I/O, cache space, spool space, and CPU are needed. Therefore, whenever possible, it is best to eliminate unneeded rows using a WHERE condition with an INNER JOIN. I like the use of WHERE because all residual conditions are located in one place.
The following samples are the same join that was performed earlier in this chapter. Here, one uses a WHERE clause and the other a compound comparison via the ON:
SELECT   Customer_name
          ,Order_number
          ,Order_total  (FORMAT  '$$$,$$9.99)
FROM  Customer_table AS cust   INNER JOIN  Order_table AS ord
        ON   cust.customer_number = ord.customer_number
WHERE Customer_name    LIKE 'Billy%
ORDER BY 3 DESC;
Or
SELECT   Customer_name
         ,Order_number
         ,Order_total   (FORMAT  '$$$,$$9.99)
FROM  Customer_table AS cust   INNER JOIN  Order_table AS ord
      ON   cust.customer_number = ord.customer_number
              AND  Customer_name LIKE
ORDER BY 3 DESC;
2 Rows Returned
Customer_name
Order_number
Order_total
Billy's Best Choice
123456
$12,347.53
Billy's Best Choice
123512
$8,005.91
The output is exactly the same with both coding methods. This can be verified using the EXPLAIN. We recommend using the WHERE clause with an inner join because it consolidates all residual conditions in a single location that is easy to find when changes are needed. Although there are multiple ON comparisons, there is only one WHERE clause.

Saturday 19 October 2013

How to identify PPI columns?

Select databasename , tablename , columnposition ,columnname from dbc.indices
where indextype ='Q'
order by 1 ,2,3 ;

How to skip the header row in the fastload script

RECORD 2;   /* this skips first record in the source file */
DEFINE ...

How many types of Skew exist?

If you utilized unequally TD resources (CPU,AMP,IO,Disk and etc) this is called skew exists. Major are 3 types of skews (CPU skew, AMP/Data skew, IO Skew).

-Data skew?
When data is not distributed equally on all the AMPs.
-Cpu skew?
Who is taking/consuming more CPU called cpu skew.
-IO skew?
Who perform more IO Operation? Resulting in IO Skew

What is a role?

A role is a set of access rights which can be assigned to the users.   They indirectly help in performance by reducing the number of rows entered in DBC.accessrights

Friday 18 October 2013

Teradata error : 7535: Operation not allowed: table header has invalid partitioning.

This can occur due to a restore/copy of the table to another system with different hardware or operating system. In this case, the partitioning expressions need to be regenerated and updated in the table headers of the table. 

Example:

SELECT Failed. 7535: Operation not allowed: MTD_DB.MTD_MSTR table header has invalid partitioning. 

solution to this is validating table header with primary index of table.
For that we can ALTER the TABLE with the REVALIDATE PRIMARY INDEX option.

here we go... command is ..


alter table Mtd_tb.mtd_mstr revalidate primary index;

Get current session information in Teradata

HELP SESSION;

Query to find out biggest table in teradata

SELECT TOP 1 DATABASENAME, TABLENAME ,SUM(CURRENTPERM) FROM DBC.TABLESIZE  GROUP BY 1,2 ORDER BY 3 DESC;

DBQL: Database Query Log

Database Query Log is an extremely useful tool to the Teradata DBA in terms of analysing resource usage as well as for security purposes.
Basically it is a log of queries submitted by users of the Teradata Database. It can record varying levels of detail but includes CPU usage, I/O and elapsed time and details of the objects accessed.

DBQL Step data can help a DBA drill deeper into a problem query by showing which query step is causing the problem skewing, product joining or large scan. This data, when matched up with the EXPLAIN plan, can provide the insight needed to tune the query.


Highlights
Allow you to log query processing activity for later analysis. SQL and other metrics are captured

DBQL captures details and metrics for workload analysis and performance tuning purposes

DBQL comes with empty pre-defined tables.

The DBA sets up rules to what information DBQL should capture and then turns DBQL ON !

The empty pre-defined tables now begin to fill up with the information the DBA said to capture

The DBA can tell DBQL to capture SQL statements, what objects were referenced, thresholds and summary rules, Explain steps, query exceptions
defined in workloads, and other important metrics

THe DBA has choices and can turn on or turn off DBQL or the DBA can only capture one person, an Accounting group, workload, or everyone. DBQL works
great to help define of redefine workloads!

Purpose/Use of collect statistics

The following list is a very high-level description of the most important purposes for column 
and index statistics.

• The Optimizer uses statistics to decide whether it should generate a query plan that use a 
secondary, hash, or join index instead of performing a full-table scan. 

• The Optimizer uses statistics to estimate the cardinalities of intermediate spool files based 
on the qualifying conditions specified by a query. 
The estimated cardinality of intermediate results is critical for the determination of both 
optimal join orders for tables and the kind of join method that should be used to make 
those joins. 
For example, should 2 tables or spool files be redistributed and then merge joined, or 
should one of the tables or spool files be duplicated and then product joined with the 
other. Depending on how accurate the statistics are, the generated join plan can vary so 
greatly that the same query can take only seconds to complete using one join plan, but take 
hours to complete using another. 

• For PPI tables, statistics collected on the PARTITION system-derived column permit the 
Optimizer to better estimate costs. 
Teradata Database also uses PARTITION statistics for estimates when predicates are based 
on the PARTITION column, for example: WHERE PARTITION IN (3,4). 

• The Teradata Statistics Wizard uses existing statistics, which it assumes are current by 
default, as a basis for recommending other statistics to be collected for a query or query 
workload (see “Special Considerations for Sampled and Stale Statistics” on page 668).
This is yet another reason you should keep your statistics as current as you possibly can.

Most ETL systems follow first with dimension load then fact load. In case if fact arrives first then how you load?

We have to just set the early arriving fact as “unknown” and set it to a unknown dimension key (usually –1) in the fact table. 

Another option we can use is the idea of an inferred dimension. As you load your fact table data you do a lookup to your dimension. 
If you have a match, cool, take that key and move on. If you don’t have a match, instead of setting the key to –1 (unknown), do this:

1) Insert a new dimension record with your business key from your fact table
2) Grab the newly created dimension key from the record you just inserted

3) Merge the key back into your fact data pipeline.

Saturday 12 October 2013

Volatile Temporary Table (VTT) vs Global Temporary Table (GTT)

VTT are session scoped whereas GTT is also a session scoped but its DDL can be accessed by other sessions because it is stored in data dictionary.
GTT takes up space from temporary space where as VTT takes up space from spool.
GTT survive TD system restarts where as VTT does not.
GTT allows index to be created where as VTT does not.
In a single session 2000 GTT can be materialized where as in case of VTT the count is only 1000.
GTT allows some constraints to be created on where as VTT does not.

What is RAID, What are the types of RAID?

Redundant Array of Inexpensive Disks (RAID) is a type of protection available in Teradata. RAID  provides Data protection at the disk Drive level. It ensures data is available even when the disk drive had failed.

There are around 6 levels of RAID ( RAID0 to RAID5) . 
Teradata supports two levels of RAID protection
RAID 1 - Mirrored copy of data in other disk
RAID 5 - Parity bit (XOR) based Data protection on each disk array.

One of the major overhead's of RAID is Space consumption

Saturday 5 October 2013

What are the new features of Teradata 13.10 ?

Here are few new features of Teradata 13.10
Compression on varchar data type fields
New calendar functions in TD13.10
There are lot of calendar functions introduced in 13.10   which are 
·         day_of_week
·         day_of_month
·         day_of_year
·         day_of_calendar
·         weekday_of_month
·         week_of_month
·         week_of_year
·         week_of_calendar
·         month_of_quarter
·         month_of_year
·         month_of_calendar
·         quarter_of_year
·         quarter_of_calendar
·         year_of_calendar

Compression
1. Multi valued Compression.

This is one of my most favourite functionality in Teradata.  Since its introduction in V2R5, it has gained popularity considering the enormous space savings and reduced I/O in huge data warehouses.

Top keep upto its promise, Teradata has gone one step further to include more data types for compression.
Previously multivalued compression was limited to fixed length columns types.

In TD13.10, some new features in MVC include
- Support for varchar and varbyte
- Longer columns support upto 510 characters. Previously 255 in Teradata 12, 13



2. Algorithmic compression
Users can now write their own compression and decompression algorithms for characters. 
These will have to be implemented in UDF and needs to be tested thoroughly before implementing it in full scale.
There are some samples functions Teradata has provide and can be found in SYSLIB.
TransUnicodeToUTF8, TransUTF8ToUnicode, LZCOMP, LZDECOMP, LZCOMP_L, LZDECOMP_L, CAMSET, DECAMSET, CAMSET_L,DECAMSET_L.



3. Block level compression (applicable to DBA’s, Capacity planners)
BLC is applied at data blocks and allows row data to be stored in datablocks* in a new compressed format.
To enable BLC on tables, new command "COMPRESS" is available in FERRET utility. After BLC is enabled on table, all tows added to table are compressed automatically. Similarly "UNCOMPRESS" command can be used to uncompress packed Data.

BLC is one of most straight forward methods where one doesn’t have to worry about type of data across various columns in table. Row header and internal Teradata structures are also compressed using BLC.

BLC plays major role in saving lot of I/O and Space but at same time it can have some performance implications as more CPU can be utilized to compress and uncompress data dynamically. 

Tuesday 17 September 2013

POSITION and INDEX functions

The POSITION function Returns the actual position of the character which occurs first. POSITION function is ANSI standard.

Teradata has an equivalent function called INDEX.
Both the POSITION and INDEX functions returns position of character's first occurrence in a string.

Examples for the POSITION function

SELECT POSITION( 'u' IN 'formula');  Displays Result as '5' 

Examples for the INDEX function

SELECT INDEX('formula', 'u');    Displays Result as '5'   

Monday 16 September 2013

What is ACE in AB Initio?


What is residual condition in Teradata ?

Parse trees are always presented upside down, so query
execution begins with the lower cluster of operations and terminates with the upper cluster. In an EXPLAIN of query, the expressions in the upper cluster would be referred to as residual conditions.

How do we handle if DML changing dynamically

There are many ways to handle the DMLs which changes dynamically with in a single file. Some of the suitable methods are to use a conditional DML or to call the vector functionality while calling the DMLs.

What is AB_LOCAL expression where do you use it in ab-initio?

ablocal_expr is a parameter of table component of Ab Initio.ABLOCAL() is replaced by the contents of ablocal_expr.Which we can make use in parallel unloads.There are two forms of AB_LOCAL() construct, one with no arguments and one with single argument as a table name(driving table).
The use of AB_LOCAL() construct is in Some complex SQL statements contain grammar that is not recognized by the Ab Initio parser when unloading in parallel. You can use the ABLOCAL() construct in this case to prevent the Input Table component from parsing the SQL (it will get passed through to the database). It also specifies which table to use for the parallel clause.

How do you improve the performance of a graph?

There are many ways the performance of the graph can be improved. 
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimise the number of sort components
4) Minimise sorted join component and if possible replace them by in-memory join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port
8) For large dataset don't use broadcast as partitioner
9) Minimise the use of regular expression functions like re_index in the trasfer functions
10) Avoid repartitioning of data unnecessarily

What is the difference between a DB config and a CFG file?

.dbc file has the information required for Ab Initio to connect to the database to extract or load tables or views. While .CFG file is the table configuration file created by db_config while using components like Load DB Table.

What is the function you would use to transfer a string into a decimal?

    In this case no specific function is required if the size of the string and decimal is same. Just use decimal cast with the size in the transform function and will suffice. For example, if the source field is defined as string(8) and the destination as decimal(8) then (say the field name is field1).

out.field :: (decimal(8)) in.field

If the destination field size is lesser than the input then use of string_substring function can be used likie the following.
say destination field is decimal(5).

out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /* string_lrtrim used to trim leading and trailing spaces */ 

What would be the time out value for the AbInitio process?

You can increase time-out values with the AbInitio environment variables AB_STARTUP_TIMEOUT and AB_RTEL_TIMEOUT_SECONDS


There are two Ab Initio environment variables that control time-out values. Increasing these values may help if a job fails with a connection time-out message, although there may some problem causing the time-out, such as a machine down or an incorrect name used, and increasing the time-out interval will not help.
AB_STARTUP_TIMEOUT specifies the number of seconds before timing out when starting processes on a processing node.
To start a job on a processing node, the Co>Operating System uses rsh, rexec, rlogin, telnet, ssh, or dcom. In most cases, these succeed or fail within a few seconds. However, if the processing node is heavily loaded, startup may take significantly longer. Increasing the time-out value gives the processing node more time to respond.

AB_RTEL_TIMEOUT_SECONDS controls how long to wait for the remote rlogin or telnet server to respond. 

What are the project parameters?

Project parameters specify various aspects of project behavior. Each parameter has a name and a string value. When a project is created, it comes with a set of default project parameters that set up a correspondence between file system URLs and locations within a datatore project.

Default Project Parameters

The default project parameters are the set of parameters that come with a About Projects.
The default project location parameter is PROJECT_DIR, which represents the location of a project's top-level directory in a datastore, for example, /Projects/warehouse. You specify the name of the location parameter when you create a project. PROJECT_DIR cannot be edited in the parameters editors. You can edit the parameter name using the air project modify command (see the Guide to Managing Technical Metadata for details).
The other default parameters (see the table below) refer to PROJECT_DIR in their values through $ substitution. For example, DML has a default value of $PROJECT_DIR/dml. These default parameters represent the locations of various directories in the project, and you can edit them in the Parameters Editors.
Table 1
Parameter name
Represents the location of the Directory that:
DML
Stores record format files
XFR
Stores transform files
PWD
The internal system uses to translate relative paths to absolute paths
RUN
The graphs of the project execute in
DB
Stores database interface files
You can reference project parameters from the graphs in the project, and the components in them, using $ substitution.

You can view and edit project parameters or add new parameters directly by using the Project  Parameters Editor, or indirectly by using the Sandbox Parameters Editor and then checking the project in to the datastore. The latter method is strongly recommended. 

What are the graph parameters?

Graph parameters are associated with individual graphs and are private to them. They affect the execution only of the graph for which they are defined. All the specifiable values of a graph, including each component's parameters (as well as other values such as URLs, file protections, and record formats) comprise that graph's parameters.
For example, the Layout parameter of the A-Transactions input file component in the tutorial Join Customers graph (see Lesson 3: Modifying graphs and seeing changes in the browser) is a graph parameter. Its value is the URL of the A-Transactions input file. Graph parameters are part of the graph they are associated with and are checked in or out along with the graph. 

Saturday 14 September 2013

I am loading 1 lakh records into one table1 by using MLOAD and from the same table I am deleting 1 lakh records by using MLOAD. Then which script is faster? Either MLOAD import or MLOAD delete?

Definitely MLOAD delete only. Because it uses only 4 phases(Acquisition phase is not used) and will not use any transient journal. 

What is deadlock ?

A deadlock occurs when transaction 1 places a lock on resource A, and then needs to lock resource B. But resource B has already been locked by transaction 2, which in turn needs to place a lock on resource A. This state of affairs is called a deadlock or a deadly embrace. To resolve a deadlock, Teradata Database aborts one of the transactions and performs a rollback.

Note:-For example, a statement in BTEQ ends with a semicolon (;) as the last non-blank character in
the line. 
Thus, BTEQ sees the following example as two requests:
            sel * from table1;
            sel * from table2;
However, if you write these same statements in the following way, BTEQ sees them as only one request:
 sel * from table1
; sel * from table2;

What are different types of Locks available in Teradata ?

There are 4 types of Locks in Teradata Database which are mentioned below 
Access Lock:-
The use of an access lock allows for reading data while modifications are in process. Access locks are designed for decision support on tables that are updated only by small, single-row changes. Access locks is not concerned about data consistency. Access locks prevent other users from obtaining the Exclusive locks on the locked data.
Read Lock:-
Read locks are used to ensure consistency during read operations. Several users may hold concurrent read locks on the same data, during this time no data modification is permitted. Read locks prevent other users from obtaining the Exclusive locks andWrite locks on the locked data.
Write Lock:-
Write locks enable users to modify data while maintaining data consistency. While the data has a write lock on it, other users can only obtain an access lock. During this time, all other locks are held in a queue until the write lock is released.
Exclusive Lock:-
Exclusive locks are applied to databases or tables and not to rows. When an exclusive lock is applied, no other user can access the database or table. Exclusive locks are used when a DDL command is executed .An exclusive lock on a database or table prevents other users from obtaining any lock on the locked object.

What is "checksum" in Teradata?

It basically specifies the (percentage) amount of data that should be used to compute the CHECKSUM for the table. 
It's used for detecting data corruption (bad blocks ?)
More % sample = More reliability in detecting errors, but to compute CHECKSUM over more data, you consume more CPU cycles and takes more time ...


The problem in the diskdrive and disk array...can corrupt the data....

these type of corrupted data cant be found easily..but queries against these corrupted data will get u wrong answers..we can find the corruption by means of scandisk and checktable.....
These errors will reduce the availability of the DWH.......This Kinda Errors is called DIsk I/o Errors In order to avoid this in TD we have the DIsk I/o Integrity Check....
by means of checksum for table level......this is a kinda protection technique by which we can select the various levels of corruption checking ..........
These checks are done by some integrity methods.....
This feature detects and logs the disk i/o errors 

TD give predefined data integrity levels check.....

default,low,end,medium,high....etc...

This checksum can be enabled.....using create table for table level.. DDL.

for system level use DBScontrol utilty to set the parameter

If u wanna more hands on then u ve to use the scandisk and checktbl utility....

u have to run the checktbl utility in level 3 so that it will diagnose the entire rows,byte by byte..

Thursday 12 September 2013

Metadata

Introduction

In this electronic age where digital information is being created at a fantastic rate, tools are necessary to locate, access, manage, and understand it all—and that's where metadata comes in. A common definition of the term is "data about data." Metadata can serve many functions in data administration, including detailing the data's history, conditions for use, format, and management requirements. The Minnesota State Archives' interest in metadata stems from its mandate to identify, collect, preserve, and make available the historically valuable records of government regardless of format.

Summary


Data about data. Information (e.g., creator name, creation date) that is used to facilitate intellectual control of, and structured access to, other information. Metadata is usually defined as "data about data." Metadata allows users to locate and evaluate data without each person having to discover it anew with every use. Its basic elements are a structured format and a controlled vocabulary, which together allow for a precise and comprehensible description of content, location, and value.

While the term itself might sound new and trendy, the concept it describes is not. In some fashion, metadata has always been with us, apparent in everything from program listings in TV Guide to the nutritional information on the back of a cereal box. For government According to the State of Minnesota, an item that documents an official government transaction or action.

"All cards, correspondence, disks, maps, memoranda, microfilm, papers, photographs, recordings, reports, tapes, writings and other data, information or documentary material, regardless of physical form or characteristics, storage media or condition of use, made or received by an officer or agency of the state and an officer or agency of a county, city, town, school, district, municipal, subdivision or corporation or other public authority or political entity within the state pursuant to state law or in connection with the translation of public business by an officer or agency…. The term 'records' excludes data and information that does not become part of an official translation, library and museum material made or acquired and kept solely for reference or exhibit purpose, extra copies of documents kept only for convenience of reference and stock of publications and process documents, and bond, coupons, or other obligations or evidence of indebtedness, the destruction or other disposition of which is governed by other laws" (Minnesota Statutes, section 138.17, subd.1).

"Information that is inscribed on a tangible medium or that is stored in an electronic or other medium and is retrievable in perceivable form" (Minnesota Statutes, section 325L.02). records, the familiar forms of metadata are the recordkeeping metadata standard and the records retention schedule.

Anyone who has suffered the exercise in irrelevance offered by an The vast network of computer systems that enables worldwide connectivity among users and computers. Internet search engine will appreciate the value of precise metadata. Because "Data, text, images, sounds, codes, computer programs, software, databases, or the like" (Minnesota Statutes, section 325L.02). information in a digital format is only legible through the use of intermediary hardware and software, the role of metadata in information technology is fundamentally important. In any system, given the volume of information it contains, the uses to which it can be put, and the costs involved, metadata is the basic tool for efficiency and effectiveness.

Whatever you want to do with the information (e.g., protect its confidentiality, present it as evidence, provide citizens access to it, broadcast it, share it, preserve it, destroy it) will be feasible only if you and your partners can understand and rely upon the metadata describing it. Using metadata effectively means understanding and applying the standards appropriate to your needs.

Metadata Functions
Government agencies routinely use metadata to fulfill a variety of functions, but the primary uses are for:

  • Legal and statutory reasons (e.g., to satisfy records management laws and the rules of evidence)
  • Technological reasons (e.g., to design and document systems)
  • Operational or administrative reasons (e.g., to document decisions and establish accountability)
  • Service to citizens, agency staff, and others (e.g., to locate and share information)
In all of these cases, metadata standards will be effective only if they rely on a structured format and controlled vocabulary. "Structured format" means the metadata is defined in terms of specific, standardized elements or fields. For example, a library catalog entry for a book will identify its author, title, subject(s), and location, among other things. Unless all the elements are there, users will not be able to evaluate the metadata; they won't be able to answer the question "Is this the book I want?"
"Controlled vocabulary" means that there is a standard as well for the content of the elements. For example, the nutritional information on the back of a box of cereal is often defined in terms of weight per serving. We know what “sugar: three grams” means. It refers to a standard unit of measurement that allows us to compare the sugar content of one cereal to that of another. But if the box read "just the way you like it" or "pretty sweet," that would mean different things to different people. We couldn't compare a subjective review like that to what's on the back of another box of cereal.
To work effectively, the elements and components of metadata should have an accepted, precise meaning that reflects a common understanding among its creators and its users. That allows for evaluation and comparison, for selecting the information you want from all the information available.
Metadata and Information Technology
Metadata is useful for the management of information in any storage format, paper or digital. But it is critically important for information in a digital format because that is only legible through the use of intermediary hardware and software. We can open up a book or even hold microfilm up to a light to determine what it says. But we can't just look at a CD and say what's on it. We cannot possibly hope to locate, evaluate, or use all the files on a single PC, let alone the Internet, without metadata.

If information technology makes metadata necessary, it's information technology that makes metadata useful. Special software applications, such as TagGen, make the creation of standardized metadata simpler. Databases store and provide access to metadata. Most software applications automatically create metadata and associate it with files. One example is the header and routing information that accompany an e-mail message. Another is the set of properties created with every Microsoft Word document; certain elements such as the title, author, file size, etc., are automatically created, but other elements can be customized and created manually. Normally, some combination of automatically and manually created information is best for precise and practical metadata.
Most important, metadata can inform business rules and software code that transforms it into "executable knowledge." For example, metadata can be used for batch processing of files. A date element is critical to records management, as most record retention schedules are keyed to a record's date of creation. Metadata in more sophisticated data formats, such as eXtensible Markup Language (XML), allow for extraction, use, and calculation based on specific components of a metadata record.





































By Ralph Kimball


Metadata is an amazing topic in the data warehouse world. Considering that we don’t know exactly what it is, or where it is, we spend more time talking about it, worrying about it, and feeling guilty we aren’t doing anything about it than any other topic. Several years ago we decided that metadata is any data about data. This wasn’t very helpful because it didn’t paint a clear picture in our minds. This fuzzy view gradually cleared up, and recently we have been talking more confidently about the "back-room metadata" that guides the extraction, cleaning, and loading processes, as well as the "front-room metadata" that makes our query tools and report writers function smoothly.
The back-room metadata presumably helps the DBA bring the data into the warehouse and is probably also of interest to business users when they ask from where the data came. The front-room metadata is mostly for the benefit of the end user, and its definition has been expanded not only to include the oil that makes our tools function smoothly, but also a kind of dictionary of business content represented by all the data elements.
Even these definitions, as helpful as they are, fail to give the data warehouse manager much of a feeling for what it is he or she is supposed to do. It sounds like whatever this metadata stuff is, it’s important, and we better:
  • Make a nice annotated list of all of it.
  • Decide just how important each part is.
  • Take responsibility for it.
  • Decide what constitutes a consistent and working set of it.
  • Decide whether to make it or buy it.
  • Store it somewhere for backup and recovery.
  • Make it available to the people who need it.
  • Assure its quality and make it complete and up to date.
  • Control it from one place.
  • Document all of these responsibilities well enough to hand this job off (soon).
Now there is a good, solid IT set of responsibilities. So far, so good. The only trouble is, we haven’t really said what it is yet. We do notice that the last item in the above list really isn’t metadata, but rather, data about metadata. With a sinking feeling, we realize we probably need meta meta data data.
To get this under control, let’s try to make a complete list of all possible types of metadata. We surely won’t succeed in this first try, but we will learn a lot. First, let’s go to the source systems, which could be mainframes, separate nonmainframe servers, users’ desktops, third-party data providers, or even online sources. We will assume that all we do here is read the source data and extract it to a data staging area that could be on the mainframe or could be on a downstream machine. Taking a big swig of coffee, we start the list:
  • Repository specifications
  • Source schemas
  • Copy-book specifications
  • Proprietary or third-party source specifications
  • Print spool file source specifications
  • Old format specifications for archived mainframe data
  • Relational, spreadsheet, and Lotus Notes source specifications
  • Presentation graphics source specifications (for example, Powerpoint)
  • URL source specifications
  • Ownership descriptions of each source
  • Business descriptions of each source
  • Update frequencies of original sources
  • Legal limitations on the use of each source
  • Mainframe or source system job schedules
  • Access methods, access rights, privileges, and passwords for source access
  • The Cobol/JCL, C, or Basic to implement extraction
  • The automated extract tool settings, if we use such a tool
  • Results of specific extract jobs including exact times, content, and completeness.
Now let’s list all the metadata needed to get the data into a data staging area and prepare it for loading into one or more data marts. We may do this on the mainframe with hand-coded Cobol, or by using an automated extract tool. Or we may bring the flat file extracts more or less untouched into a separate data staging area on a different machine. In any case, we have to be concerned about metadata describing:
  • Data transmission scheduling and results of specific transmissions
  • File usage in the data staging area including duration, volatility, and ownership
  • Definitions of conformed dimensions and conformed facts
  • Job specifications for joining sources, stripping out fields, and looking up attributes
  • Slowly changing dimension policies for each incoming descriptive attribute (for example, overwrite, create new record, or create new field)
  • Current surrogate key assignments for each production key, including a fast lookup table to perform this mapping in memory
  • Yesterday’s copy of a production dimension to use as the basis for Diff Compare
  • Data cleaning specifications
  • Data enhancement and mapping transformations (for example, expanding abbreviations and providing more detail)
  • Transformations required for data mining (for example, interpreting nulls and scaling numerics)
  • Target schema designs, source to target data flows, target data ownership, and DBMS load scripts
  • Aggregate definitions
  • Aggregate usage statistics, base table usage statistics, potential aggregates
  • Aggregate modification logs
  • Data lineage and audit records (where exactly did this record come from and when)
  • Data transform run-time logs, success summaries, and time stamps
  • Data transform software version numbers
  • Business descriptions of extract processing
  • Security settings for extract files, software, and metadata
  • Security settings for data transmission (that is, passwords, certificates, and so on)
  • Data staging area archive logs and recovery procedures
  • Data staging-archive security settings.
Once we have finally transferred the data to the data mart DBMS, then we must have metadata, including:
  • DBMS system tables
  • Partition settings
  • Indexes
  • Disk striping specifications
  • Processing hints
  • DBMS-level security privileges and grants
  • View definitions
  • Stored procedures and SQL administrative scripts
  DBMS backup status, procedures, and security. In the front room, we have metadata extending to the horizon, including:
  • Precanned query and report definitions
  • Join specification tool settings
  • Pretty print tool specifications (for relabeling fields in readable ways)
  • End-user documentation and training aids, both vendor supplied and IT supplied
  • Network security user privilege profiles, authentication certificates, and usage statistics, including logon attempts, access attempts, and user ID by location reports
  • Individual user profiles, with links to human resources to track promotions, transfers, and resignations that affect access rights
  • Links to contractor and partner tracking where access rights are affected
  • Usage and access maps for data elements, tables, views, and reports
  • Resource charge back statistics
  • Favorite Web sites (as a paradigm for all data warehouse access).
Now we can see why we didn’t know what this metadata was all about. It is everything! Except for the data itself. Suddenly, the data seems like the simplest part.
With this perspective, do we really need to keep track of all this? We do, in my opinion. This list of metadata is the essential framework of your data warehouse. Just listing it as we have done seems quite helpful. It’s a long list, but we can go down through it, find each kind of metadata, and identify what it is used for and where it is stored.
There are some sobering realizations, however. Much of this metadata needs to reside on the machines close to where the work occurs. Programs, settings, and specifications that drive processes have to be in certain destination locations and in very specific formats. That isn’t likely to change soon.

Once we have taken the first step of getting our metadata corralled and under control, can we hope for tools that will pull all the metadata together in one place and be able to read and write it as well? With such a tool, not only would we have a uniform user interface for all this disparate metadata, but on a consistent basis we would be able to snapshot all the metadata at once, back it up, secure it, and restore it if we ever lost it.

Don’t hold your breath. As you can appreciate, this is a very hard problem, and encompassing all forms of metadata will require a kind of systems integration that we don’t have today. I believe the Metadata Coalition (a group of vendors trying seriously to solve the metadata problem) will make some reasonable progress in defining common syntax and semantics for metadata, but it has been two years and counting since they started this effort. Unfortunately, Oracle, the biggest DBMS player, has chosen to sit out this effort and has promised to release its own proprietary metadata standard. Other vendors are making serious efforts to extend their product suites to encompass many of the activities listed in this article and simultaneously to publish their own framework for metadata. These vendors include Microsoft, who’s working with the Metadata Coalition to extend the Microsoft Repository, as well as a pack of aggressive, smaller players proposing comprehensive metadata frameworks, including Sagent, Informatica, VMark, and D2K. In any case, these vendors will have to offer significant business advantages in order to compel other vendors to write to their specifications.