Introduction

In this electronic age where digital information is being created at a fantastic rate, tools are necessary to locate, access, manage, and understand it all—and that's where metadata comes in. A common definition of the term is "data about data." Metadata can serve many functions in data administration, including detailing the data's history, conditions for use, format, and management requirements. The Minnesota State Archives' interest in metadata stems from its mandate to identify, collect, preserve, and make available the historically valuable records of government regardless of format.

Summary

Data about data. Information (e.g., creator name, creation date) that is used to facilitate intellectual control of, and structured access to, other information. Metadata is usually defined as "data about data." Metadata allows users to locate and evaluate data without each person having to discover it anew with every use. Its basic elements are a structured format and a controlled vocabulary, which together allow for a precise and comprehensible description of content, location, and value.

While the term itself might sound new and trendy, the concept it describes is not. In some fashion, metadata has always been with us, apparent in everything from program listings in TV Guide to the nutritional information on the back of a cereal box. For government According to the State of Minnesota, an item that documents an official government transaction or action.

"All cards, correspondence, disks, maps, memoranda, microfilm, papers, photographs, recordings, reports, tapes, writings and other data, information or documentary material, regardless of physical form or characteristics, storage media or condition of use, made or received by an officer or agency of the state and an officer or agency of a county, city, town, school, district, municipal, subdivision or corporation or other public authority or political entity within the state pursuant to state law or in connection with the translation of public business by an officer or agency…. The term 'records' excludes data and information that does not become part of an official translation, library and museum material made or acquired and kept solely for reference or exhibit purpose, extra copies of documents kept only for convenience of reference and stock of publications and process documents, and bond, coupons, or other obligations or evidence of indebtedness, the destruction or other disposition of which is governed by other laws" (Minnesota Statutes, section 138.17, subd.1).

"Information that is inscribed on a tangible medium or that is stored in an electronic or other medium and is retrievable in perceivable form" (Minnesota Statutes, section 325L.02). records, the familiar forms of metadata are the recordkeeping metadata standard and the records retention schedule.

Anyone who has suffered the exercise in irrelevance offered by an The vast network of computer systems that enables worldwide connectivity among users and computers. Internet search engine will appreciate the value of precise metadata. Because "Data, text, images, sounds, codes, computer programs, software, databases, or the like" (Minnesota Statutes, section 325L.02). information in a digital format is only legible through the use of intermediary hardware and software, the role of metadata in information technology is fundamentally important. In any system, given the volume of information it contains, the uses to which it can be put, and the costs involved, metadata is the basic tool for efficiency and effectiveness.

Whatever you want to do with the information (e.g., protect its confidentiality, present it as evidence, provide citizens access to it, broadcast it, share it, preserve it, destroy it) will be feasible only if you and your partners can understand and rely upon the metadata describing it. Using metadata effectively means understanding and applying the standards appropriate to your needs.

Metadata Functions
Government agencies routinely use metadata to fulfill a variety of functions, but the primary uses are for:

Legal and statutory reasons (e.g., to satisfy records management laws and the rules of evidence)
Technological reasons (e.g., to design and document systems)
Operational or administrative reasons (e.g., to document decisions and establish accountability)
Service to citizens, agency staff, and others (e.g., to locate and share information)

In all of these cases, metadata standards will be effective only if they rely on a structured format and controlled vocabulary. "Structured format" means the metadata is defined in terms of specific, standardized elements or fields. For example, a library catalog entry for a book will identify its author, title, subject(s), and location, among other things. Unless all the elements are there, users will not be able to evaluate the metadata; they won't be able to answer the question "Is this the book I want?"
"Controlled vocabulary" means that there is a standard as well for the content of the elements. For example, the nutritional information on the back of a box of cereal is often defined in terms of weight per serving. We know what “sugar: three grams” means. It refers to a standard unit of measurement that allows us to compare the sugar content of one cereal to that of another. But if the box read "just the way you like it" or "pretty sweet," that would mean different things to different people. We couldn't compare a subjective review like that to what's on the back of another box of cereal.
To work effectively, the elements and components of metadata should have an accepted, precise meaning that reflects a common understanding among its creators and its users. That allows for evaluation and comparison, for selecting the information you want from all the information available.
Metadata and Information Technology
Metadata is useful for the management of information in any storage format, paper or digital. But it is critically important for information in a digital format because that is only legible through the use of intermediary hardware and software. We can open up a book or even hold microfilm up to a light to determine what it says. But we can't just look at a CD and say what's on it. We cannot possibly hope to locate, evaluate, or use all the files on a single PC, let alone the Internet, without metadata.
If information technology makes metadata necessary, it's information technology that makes metadata useful. Special software applications, such as TagGen, make the creation of standardized metadata simpler. Databases store and provide access to metadata. Most software applications automatically create metadata and associate it with files. One example is the header and routing information that accompany an e-mail message. Another is the set of properties created with every Microsoft Word document; certain elements such as the title, author, file size, etc., are automatically created, but other elements can be customized and created manually. Normally, some combination of automatically and manually created information is best for precise and practical metadata.
Most important, metadata can inform business rules and software code that transforms it into "executable knowledge." For example, metadata can be used for batch processing of files. A date element is critical to records management, as most record retention schedules are keyed to a record's date of creation. Metadata in more sophisticated data formats, such as eXtensible Markup Language (XML), allow for extraction, use, and calculation based on specific components of a metadata record.

By Ralph Kimball

Metadata is an amazing topic in the data warehouse world. Considering that we don’t know exactly what it is, or where it is, we spend more time talking about it, worrying about it, and feeling guilty we aren’t doing anything about it than any other topic. Several years ago we decided that metadata is any data about data. This wasn’t very helpful because it didn’t paint a clear picture in our minds. This fuzzy view gradually cleared up, and recently we have been talking more confidently about the "back-room metadata" that guides the extraction, cleaning, and loading processes, as well as the "front-room metadata" that makes our query tools and report writers function smoothly.
The back-room metadata presumably helps the DBA bring the data into the warehouse and is probably also of interest to business users when they ask from where the data came. The front-room metadata is mostly for the benefit of the end user, and its definition has been expanded not only to include the oil that makes our tools function smoothly, but also a kind of dictionary of business content represented by all the data elements.
Even these definitions, as helpful as they are, fail to give the data warehouse manager much of a feeling for what it is he or she is supposed to do. It sounds like whatever this metadata stuff is, it’s important, and we better:

Make a nice annotated list of all of it.
Decide just how important each part is.
Take responsibility for it.
Decide what constitutes a consistent and working set of it.
Decide whether to make it or buy it.
Store it somewhere for backup and recovery.
Make it available to the people who need it.
Assure its quality and make it complete and up to date.
Control it from one place.
Document all of these responsibilities well enough to hand this job off (soon).

Now there is a good, solid IT set of responsibilities. So far, so good. The only trouble is, we haven’t really said what it is yet. We do notice that the last item in the above list really isn’t metadata, but rather, data about metadata. With a sinking feeling, we realize we probably need meta meta data data.
To get this under control, let’s try to make a complete list of all possible types of metadata. We surely won’t succeed in this first try, but we will learn a lot. First, let’s go to the source systems, which could be mainframes, separate nonmainframe servers, users’ desktops, third-party data providers, or even online sources. We will assume that all we do here is read the source data and extract it to a data staging area that could be on the mainframe or could be on a downstream machine. Taking a big swig of coffee, we start the list:

Repository specifications
Source schemas
Copy-book specifications
Proprietary or third-party source specifications
Print spool file source specifications
Old format specifications for archived mainframe data
Relational, spreadsheet, and Lotus Notes source specifications
Presentation graphics source specifications (for example, Powerpoint)
URL source specifications
Ownership descriptions of each source
Business descriptions of each source
Update frequencies of original sources
Legal limitations on the use of each source
Mainframe or source system job schedules
Access methods, access rights, privileges, and passwords for source access
The Cobol/JCL, C, or Basic to implement extraction
The automated extract tool settings, if we use such a tool
Results of specific extract jobs including exact times, content, and completeness.

Now let’s list all the metadata needed to get the data into a data staging area and prepare it for loading into one or more data marts. We may do this on the mainframe with hand-coded Cobol, or by using an automated extract tool. Or we may bring the flat file extracts more or less untouched into a separate data staging area on a different machine. In any case, we have to be concerned about metadata describing:

Data transmission scheduling and results of specific transmissions
File usage in the data staging area including duration, volatility, and ownership
Definitions of conformed dimensions and conformed facts
Job specifications for joining sources, stripping out fields, and looking up attributes
Slowly changing dimension policies for each incoming descriptive attribute (for example, overwrite, create new record, or create new field)
Current surrogate key assignments for each production key, including a fast lookup table to perform this mapping in memory
Yesterday’s copy of a production dimension to use as the basis for Diff Compare
Data cleaning specifications
Data enhancement and mapping transformations (for example, expanding abbreviations and providing more detail)
Transformations required for data mining (for example, interpreting nulls and scaling numerics)
Target schema designs, source to target data flows, target data ownership, and DBMS load scripts
Aggregate definitions
Aggregate usage statistics, base table usage statistics, potential aggregates
Aggregate modification logs
Data lineage and audit records (where exactly did this record come from and when)
Data transform run-time logs, success summaries, and time stamps
Data transform software version numbers
Business descriptions of extract processing
Security settings for extract files, software, and metadata
Security settings for data transmission (that is, passwords, certificates, and so on)
Data staging area archive logs and recovery procedures
Data staging-archive security settings.

Once we have finally transferred the data to the data mart DBMS, then we must have metadata, including:

DBMS system tables
Partition settings
Indexes
Disk striping specifications
Processing hints
DBMS-level security privileges and grants
View definitions
Stored procedures and SQL administrative scripts

 DBMS backup status, procedures, and security. In the front room, we have metadata extending to the horizon, including:

Precanned query and report definitions
Join specification tool settings
Pretty print tool specifications (for relabeling fields in readable ways)
End-user documentation and training aids, both vendor supplied and IT supplied
Network security user privilege profiles, authentication certificates, and usage statistics, including logon attempts, access attempts, and user ID by location reports
Individual user profiles, with links to human resources to track promotions, transfers, and resignations that affect access rights
Links to contractor and partner tracking where access rights are affected
Usage and access maps for data elements, tables, views, and reports
Resource charge back statistics
Favorite Web sites (as a paradigm for all data warehouse access).

Now we can see why we didn’t know what this metadata was all about. It is everything! Except for the data itself. Suddenly, the data seems like the simplest part.
With this perspective, do we really need to keep track of all this? We do, in my opinion. This list of metadata is the essential framework of your data warehouse. Just listing it as we have done seems quite helpful. It’s a long list, but we can go down through it, find each kind of metadata, and identify what it is used for and where it is stored.
There are some sobering realizations, however. Much of this metadata needs to reside on the machines close to where the work occurs. Programs, settings, and specifications that drive processes have to be in certain destination locations and in very specific formats. That isn’t likely to change soon.

Once we have taken the first step of getting our metadata corralled and under control, can we hope for tools that will pull all the metadata together in one place and be able to read and write it as well? With such a tool, not only would we have a uniform user interface for all this disparate metadata, but on a consistent basis we would be able to snapshot all the metadata at once, back it up, secure it, and restore it if we ever lost it.

Don’t hold your breath. As you can appreciate, this is a very hard problem, and encompassing all forms of metadata will require a kind of systems integration that we don’t have today. I believe the Metadata Coalition (a group of vendors trying seriously to solve the metadata problem) will make some reasonable progress in defining common syntax and semantics for metadata, but it has been two years and counting since they started this effort. Unfortunately, Oracle, the biggest DBMS player, has chosen to sit out this effort and has promised to release its own proprietary metadata standard. Other vendors are making serious efforts to extend their product suites to encompass many of the activities listed in this article and simultaneously to publish their own framework for metadata. These vendors include Microsoft, who’s working with the Metadata Coalition to extend the Microsoft Repository, as well as a pack of aggressive, smaller players proposing comprehensive metadata frameworks, including Sagent, Informatica, VMark, and D2K. In any case, these vendors will have to offer significant business advantages in order to compel other vendors to write to their specifications.

Data Warehouse & Business Intelligence

Lables

Thursday, 12 September 2013

Metadata

Summary

By Ralph Kimball

No comments:

Post a Comment