Introduction
In this electronic age where digital information is being created at a fantastic rate, tools are necessary to locate, access, manage, and understand it all—and that's where metadata comes in. A common definition of the term is "data about data." Metadata can serve many functions in data administration, including detailing the data's history, conditions for use, format, and management requirements. The Minnesota State Archives' interest in metadata stems from its mandate to identify, collect, preserve, and make available the historically valuable records of government regardless of format.
Summary
Data about data. Information (e.g., creator
name, creation date) that is used to facilitate intellectual control of, and
structured access to, other information. Metadata is usually defined as
"data about data." Metadata allows users to locate and evaluate data
without each person having to discover it anew with every use. Its basic
elements are a structured format and a controlled vocabulary, which together
allow for a precise and comprehensible description of content, location, and
value.
While the term itself might sound new and
trendy, the concept it describes is not. In some fashion, metadata has always
been with us, apparent in everything from program listings in TV Guide to the
nutritional information on the back of a cereal box. For government According
to the State of Minnesota, an item that documents an official government
transaction or action.
"All cards, correspondence, disks,
maps, memoranda, microfilm, papers, photographs, recordings, reports, tapes,
writings and other data, information or documentary material, regardless of
physical form or characteristics, storage media or condition of use, made or
received by an officer or agency of the state and an officer or agency of a
county, city, town, school, district, municipal, subdivision or corporation or
other public authority or political entity within the state pursuant to state
law or in connection with the translation of public business by an officer or
agency…. The term 'records' excludes data and information that does not become
part of an official translation, library and museum material made or acquired
and kept solely for reference or exhibit purpose, extra copies of documents kept
only for convenience of reference and stock of publications and process
documents, and bond, coupons, or other obligations or evidence of indebtedness,
the destruction or other disposition of which is governed by other laws"
(Minnesota Statutes, section 138.17, subd.1).
"Information that is inscribed on a
tangible medium or that is stored in an electronic or other medium and is
retrievable in perceivable form" (Minnesota Statutes, section 325L.02).
records, the familiar forms of metadata are the recordkeeping metadata standard
and the records retention schedule.
Anyone who has suffered the exercise in
irrelevance offered by an The vast network of computer systems that enables
worldwide connectivity among users and computers. Internet search engine will
appreciate the value of precise metadata. Because "Data, text, images,
sounds, codes, computer programs, software, databases, or the like"
(Minnesota Statutes, section 325L.02). information in a digital format is only
legible through the use of intermediary hardware and software, the role of
metadata in information technology is fundamentally important. In any system,
given the volume of information it contains, the uses to which it can be put,
and the costs involved, metadata is the basic tool for efficiency and
effectiveness.
Whatever you want to do with the
information (e.g., protect its confidentiality, present it as evidence, provide
citizens access to it, broadcast it, share it, preserve it, destroy it) will be
feasible only if you and your partners can understand and rely upon the
metadata describing it. Using metadata effectively means understanding and
applying the standards appropriate to your needs.
Government agencies routinely use metadata to fulfill a variety of functions, but the primary uses are for:
- Legal and
statutory reasons (e.g., to satisfy records management laws and the rules
of evidence)
- Technological
reasons (e.g., to design and document systems)
- Operational
or administrative reasons (e.g., to document decisions and establish
accountability)
- Service to
citizens, agency staff, and others (e.g., to locate and share information)
"Controlled vocabulary" means that there is a standard as well for the content of the elements. For example, the nutritional information on the back of a box of cereal is often defined in terms of weight per serving. We know what “sugar: three grams” means. It refers to a standard unit of measurement that allows us to compare the sugar content of one cereal to that of another. But if the box read "just the way you like it" or "pretty sweet," that would mean different things to different people. We couldn't compare a subjective review like that to what's on the back of another box of cereal.
To work effectively, the elements and components of metadata should have an accepted, precise meaning that reflects a common understanding among its creators and its users. That allows for evaluation and comparison, for selecting the information you want from all the information available.
Metadata and Information Technology
Metadata is useful for the management of information in any storage format, paper or digital. But it is critically important for information in a digital format because that is only legible through the use of intermediary hardware and software. We can open up a book or even hold microfilm up to a light to determine what it says. But we can't just look at a CD and say what's on it. We cannot possibly hope to locate, evaluate, or use all the files on a single PC, let alone the Internet, without metadata.
If information technology makes metadata necessary, it's information technology that makes metadata useful. Special software applications, such as TagGen, make the creation of standardized metadata simpler. Databases store and provide access to metadata. Most software applications automatically create metadata and associate it with files. One example is the header and routing information that accompany an e-mail message. Another is the set of properties created with every Microsoft Word document; certain elements such as the title, author, file size, etc., are automatically created, but other elements can be customized and created manually. Normally, some combination of automatically and manually created information is best for precise and practical metadata.
Most important, metadata can inform business rules and software code that transforms it into "executable knowledge." For example, metadata can be used for batch processing of files. A date element is critical to records management, as most record retention schedules are keyed to a record's date of creation. Metadata in more sophisticated data formats, such as eXtensible Markup Language (XML), allow for extraction, use, and calculation based on specific components of a metadata record.
By
Ralph Kimball
The back-room metadata presumably helps the DBA bring the data into the warehouse and is probably also of interest to business users when they ask from where the data came. The front-room metadata is mostly for the benefit of the end user, and its definition has been expanded not only to include the oil that makes our tools function smoothly, but also a kind of dictionary of business content represented by all the data elements.
Even these definitions, as helpful as they are, fail to give the data warehouse manager much of a feeling for what it is he or she is supposed to do. It sounds like whatever this metadata stuff is, it’s important, and we better:
- Make a nice
annotated list of all of it.
- Decide just
how important each part is.
- Take
responsibility for it.
- Decide what
constitutes a consistent and working set of it.
- Decide
whether to make it or buy it.
- Store it
somewhere for backup and recovery.
- Make it
available to the people who need it.
- Assure its
quality and make it complete and up to date.
- Control it
from one place.
- Document all
of these responsibilities well enough to hand this job off (soon).
To get this under control, let’s try to make a complete list of all possible types of metadata. We surely won’t succeed in this first try, but we will learn a lot. First, let’s go to the source systems, which could be mainframes, separate nonmainframe servers, users’ desktops, third-party data providers, or even online sources. We will assume that all we do here is read the source data and extract it to a data staging area that could be on the mainframe or could be on a downstream machine. Taking a big swig of coffee, we start the list:
- Repository
specifications
- Source
schemas
- Copy-book
specifications
- Proprietary
or third-party source specifications
- Print spool
file source specifications
- Old format
specifications for archived mainframe data
- Relational,
spreadsheet, and Lotus Notes source specifications
- Presentation
graphics source specifications (for example, Powerpoint)
- URL source
specifications
- Ownership
descriptions of each source
- Business
descriptions of each source
- Update
frequencies of original sources
- Legal
limitations on the use of each source
- Mainframe or
source system job schedules
- Access
methods, access rights, privileges, and passwords for source access
- The
Cobol/JCL, C, or Basic to implement extraction
- The
automated extract tool settings, if we use such a tool
- Results of
specific extract jobs including exact times, content, and completeness.
- Data
transmission scheduling and results of specific transmissions
- File usage
in the data staging area including duration, volatility, and ownership
- Definitions
of conformed dimensions and conformed facts
- Job
specifications for joining sources, stripping out fields, and looking up
attributes
- Slowly
changing dimension policies for each incoming descriptive attribute (for
example, overwrite, create new record, or create new field)
- Current
surrogate key assignments for each production key, including a fast lookup
table to perform this mapping in memory
- Yesterday’s
copy of a production dimension to use as the basis for Diff Compare
- Data
cleaning specifications
- Data
enhancement and mapping transformations (for example, expanding
abbreviations and providing more detail)
- Transformations
required for data mining (for example, interpreting nulls and scaling
numerics)
- Target
schema designs, source to target data flows, target data ownership, and
DBMS load scripts
- Aggregate
definitions
- Aggregate
usage statistics, base table usage statistics, potential aggregates
- Aggregate
modification logs
- Data lineage
and audit records (where exactly did this record come from and when)
- Data
transform run-time logs, success summaries, and time stamps
- Data
transform software version numbers
- Business
descriptions of extract processing
- Security
settings for extract files, software, and metadata
- Security
settings for data transmission (that is, passwords, certificates, and so
on)
- Data staging
area archive logs and recovery procedures
- Data
staging-archive security settings.
- DBMS system
tables
- Partition
settings
- Indexes
- Disk
striping specifications
- Processing
hints
- DBMS-level
security privileges and grants
- View
definitions
- Stored
procedures and SQL administrative scripts
DBMS backup status, procedures, and security. In the front room, we have
metadata extending to the horizon, including:
- Precanned
query and report definitions
- Join
specification tool settings
- Pretty print
tool specifications (for relabeling fields in readable ways)
- End-user
documentation and training aids, both vendor supplied and IT supplied
- Network
security user privilege profiles, authentication certificates, and usage
statistics, including logon attempts, access attempts, and user ID by
location reports
- Individual
user profiles, with links to human resources to track promotions,
transfers, and resignations that affect access rights
- Links to
contractor and partner tracking where access rights are affected
- Usage and
access maps for data elements, tables, views, and reports
- Resource
charge back statistics
- Favorite Web
sites (as a paradigm for all data warehouse access).
With this perspective, do we really need to keep track of all this? We do, in my opinion. This list of metadata is the essential framework of your data warehouse. Just listing it as we have done seems quite helpful. It’s a long list, but we can go down through it, find each kind of metadata, and identify what it is used for and where it is stored.
There are some sobering realizations, however. Much of this metadata needs to reside on the machines close to where the work occurs. Programs, settings, and specifications that drive processes have to be in certain destination locations and in very specific formats. That isn’t likely to change soon.
Don’t hold your breath. As you can appreciate, this is a very hard problem, and encompassing all forms of metadata will require a kind of systems integration that we don’t have today. I believe the Metadata Coalition (a group of vendors trying seriously to solve the metadata problem) will make some reasonable progress in defining common syntax and semantics for metadata, but it has been two years and counting since they started this effort. Unfortunately, Oracle, the biggest DBMS player, has chosen to sit out this effort and has promised to release its own proprietary metadata standard. Other vendors are making serious efforts to extend their product suites to encompass many of the activities listed in this article and simultaneously to publish their own framework for metadata. These vendors include Microsoft, who’s working with the Metadata Coalition to extend the Microsoft Repository, as well as a pack of aggressive, smaller players proposing comprehensive metadata frameworks, including Sagent, Informatica, VMark, and D2K. In any case, these vendors will have to offer significant business advantages in order to compel other vendors to write to their specifications.
No comments:
Post a Comment