Introduction
In this electronic age where digital information is being created at a
fantastic rate, tools are necessary to locate, access, manage, and understand
it all—and that's where metadata comes in. A common definition of the term is
"data about data." Metadata can serve many functions in data
administration, including detailing the data's history, conditions for use,
format, and management requirements. The Minnesota State Archives' interest in
metadata stems from its mandate to identify, collect, preserve, and make
available the historically valuable records of government regardless of format.
Summary
Data about data. Information (e.g., creator
name, creation date) that is used to facilitate intellectual control of, and
structured access to, other information. Metadata is usually defined as
"data about data." Metadata allows users to locate and evaluate data
without each person having to discover it anew with every use. Its basic
elements are a structured format and a controlled vocabulary, which together
allow for a precise and comprehensible description of content, location, and
value.
While the term itself might sound new and
trendy, the concept it describes is not. In some fashion, metadata has always
been with us, apparent in everything from program listings in TV Guide to the
nutritional information on the back of a cereal box. For government According
to the State of Minnesota, an item that documents an official government
transaction or action.
"All cards, correspondence, disks,
maps, memoranda, microfilm, papers, photographs, recordings, reports, tapes,
writings and other data, information or documentary material, regardless of
physical form or characteristics, storage media or condition of use, made or
received by an officer or agency of the state and an officer or agency of a
county, city, town, school, district, municipal, subdivision or corporation or
other public authority or political entity within the state pursuant to state
law or in connection with the translation of public business by an officer or
agency…. The term 'records' excludes data and information that does not become
part of an official translation, library and museum material made or acquired
and kept solely for reference or exhibit purpose, extra copies of documents kept
only for convenience of reference and stock of publications and process
documents, and bond, coupons, or other obligations or evidence of indebtedness,
the destruction or other disposition of which is governed by other laws"
(Minnesota Statutes, section 138.17, subd.1).
"Information that is inscribed on a
tangible medium or that is stored in an electronic or other medium and is
retrievable in perceivable form" (Minnesota Statutes, section 325L.02).
records, the familiar forms of metadata are the recordkeeping metadata standard
and the records retention schedule.
Anyone who has suffered the exercise in
irrelevance offered by an The vast network of computer systems that enables
worldwide connectivity among users and computers. Internet search engine will
appreciate the value of precise metadata. Because "Data, text, images,
sounds, codes, computer programs, software, databases, or the like"
(Minnesota Statutes, section 325L.02). information in a digital format is only
legible through the use of intermediary hardware and software, the role of
metadata in information technology is fundamentally important. In any system,
given the volume of information it contains, the uses to which it can be put,
and the costs involved, metadata is the basic tool for efficiency and
effectiveness.
Whatever you want to do with the
information (e.g., protect its confidentiality, present it as evidence, provide
citizens access to it, broadcast it, share it, preserve it, destroy it) will be
feasible only if you and your partners can understand and rely upon the
metadata describing it. Using metadata effectively means understanding and
applying the standards appropriate to your needs.
Metadata Functions
Government agencies routinely use metadata to fulfill a variety of functions,
but the primary uses are for:
- Legal and
statutory reasons (e.g., to satisfy records management laws and the rules
of evidence)
- Technological
reasons (e.g., to design and document systems)
- Operational
or administrative reasons (e.g., to document decisions and establish
accountability)
- Service to
citizens, agency staff, and others (e.g., to locate and share information)
In
all of these cases, metadata standards will be effective only if they rely on a
structured format and controlled vocabulary. "Structured format"
means the metadata is defined in terms of specific, standardized elements or
fields. For example, a library catalog entry for a book will identify its
author, title, subject(s), and location, among other things. Unless all the
elements are there, users will not be able to evaluate the metadata; they won't
be able to answer the question "Is this the book I want?"
"Controlled
vocabulary" means that there is a standard as well for the content of the
elements. For example, the nutritional information on the back of a box of
cereal is often defined in terms of weight per serving. We know what “sugar:
three grams” means. It refers to a standard unit of measurement that allows us
to compare the sugar content of one cereal to that of another. But if the box
read "just the way you like it" or "pretty sweet," that
would mean different things to different people. We couldn't compare a
subjective review like that to what's on the back of another box of cereal.
To
work effectively, the elements and components of metadata should have an
accepted, precise meaning that reflects a common understanding among its
creators and its users. That allows for evaluation and comparison, for
selecting the information you want from all the information available.
Metadata and Information Technology
Metadata is useful for the management of information in any storage format,
paper or digital. But it is critically important for information in a digital
format because that is only legible through the use of intermediary hardware and
software. We can open up a book or even hold microfilm up to a light to
determine what it says. But we can't just look at a CD and say what's on it. We
cannot possibly hope to locate, evaluate, or use all the files on a single PC,
let alone the Internet, without metadata.
If
information technology makes metadata necessary, it's information technology
that makes metadata useful. Special software applications, such as TagGen, make
the creation of standardized metadata simpler. Databases store and provide
access to metadata. Most software applications automatically create metadata
and associate it with files. One example is the header and routing information
that accompany an e-mail message. Another is the set of properties created with
every Microsoft Word document; certain elements such as the title, author, file
size, etc., are automatically created, but other elements can be customized and
created manually. Normally, some combination of automatically and manually
created information is best for precise and practical metadata.
Most
important, metadata can inform business rules and software code that transforms
it into "executable knowledge." For example, metadata can be used for
batch processing of files. A date element is critical to records management, as
most record retention schedules are keyed to a record's date of creation.
Metadata in more sophisticated data formats, such as eXtensible Markup Language
(XML), allow for extraction, use, and calculation based on specific components
of a metadata record.
By
Ralph Kimball
Metadata
is an amazing topic in the data warehouse world. Considering that we don’t know
exactly what it is, or where it is, we spend more time talking about it,
worrying about it, and feeling guilty we aren’t doing anything about it than
any other topic. Several years ago we decided that metadata is any data about
data. This wasn’t very helpful because it didn’t paint a clear picture in our
minds. This fuzzy view gradually cleared up, and recently we have been talking
more confidently about the "back-room metadata" that guides the
extraction, cleaning, and loading processes, as well as the "front-room
metadata" that makes our query tools and report writers function smoothly.
The
back-room metadata presumably helps the DBA bring the data into the warehouse
and is probably also of interest to business users when they ask from where the
data came. The front-room metadata is mostly for the benefit of the end user,
and its definition has been expanded not only to include the oil that makes our
tools function smoothly, but also a kind of dictionary of business content
represented by all the data elements.
Even
these definitions, as helpful as they are, fail to give the data warehouse
manager much of a feeling for what it is he or she is supposed to do. It sounds
like whatever this metadata stuff is, it’s important, and we better:
- Make a nice
annotated list of all of it.
- Decide just
how important each part is.
- Take
responsibility for it.
- Decide what
constitutes a consistent and working set of it.
- Decide
whether to make it or buy it.
- Store it
somewhere for backup and recovery.
- Make it
available to the people who need it.
- Assure its
quality and make it complete and up to date.
- Control it
from one place.
- Document all
of these responsibilities well enough to hand this job off (soon).
Now
there is a good, solid IT set of responsibilities. So far, so good. The only
trouble is, we haven’t really said what it is yet. We do notice that the last
item in the above list really isn’t metadata, but rather, data about metadata.
With a sinking feeling, we realize we probably need meta meta data data.
To
get this under control, let’s try to make a complete list of all possible types
of metadata. We surely won’t succeed in this first try, but we will learn a
lot. First, let’s go to the source systems, which could be mainframes, separate
nonmainframe servers, users’ desktops, third-party data providers, or even
online sources. We will assume that all we do here is read the source data and
extract it to a data staging area that could be on the mainframe or could be on
a downstream machine. Taking a big swig of coffee, we start the list:
- Repository
specifications
- Source
schemas
- Copy-book
specifications
- Proprietary
or third-party source specifications
- Print spool
file source specifications
- Old format
specifications for archived mainframe data
- Relational,
spreadsheet, and Lotus Notes source specifications
- Presentation
graphics source specifications (for example, Powerpoint)
- URL source
specifications
- Ownership
descriptions of each source
- Business
descriptions of each source
- Update
frequencies of original sources
- Legal
limitations on the use of each source
- Mainframe or
source system job schedules
- Access
methods, access rights, privileges, and passwords for source access
- The
Cobol/JCL, C, or Basic to implement extraction
- The
automated extract tool settings, if we use such a tool
- Results of
specific extract jobs including exact times, content, and completeness.
Now
let’s list all the metadata needed to get the data into a data staging area and
prepare it for loading into one or more data marts. We may do this on the
mainframe with hand-coded Cobol, or by using an automated extract tool. Or we
may bring the flat file extracts more or less untouched into a separate data
staging area on a different machine. In any case, we have to be concerned about
metadata describing:
- Data
transmission scheduling and results of specific transmissions
- File usage
in the data staging area including duration, volatility, and ownership
- Definitions
of conformed dimensions and conformed facts
- Job
specifications for joining sources, stripping out fields, and looking up
attributes
- Slowly
changing dimension policies for each incoming descriptive attribute (for
example, overwrite, create new record, or create new field)
- Current
surrogate key assignments for each production key, including a fast lookup
table to perform this mapping in memory
- Yesterday’s
copy of a production dimension to use as the basis for Diff Compare
- Data
cleaning specifications
- Data
enhancement and mapping transformations (for example, expanding
abbreviations and providing more detail)
- Transformations
required for data mining (for example, interpreting nulls and scaling
numerics)
- Target
schema designs, source to target data flows, target data ownership, and
DBMS load scripts
- Aggregate
definitions
- Aggregate
usage statistics, base table usage statistics, potential aggregates
- Aggregate
modification logs
- Data lineage
and audit records (where exactly did this record come from and when)
- Data
transform run-time logs, success summaries, and time stamps
- Data
transform software version numbers
- Business
descriptions of extract processing
- Security
settings for extract files, software, and metadata
- Security
settings for data transmission (that is, passwords, certificates, and so
on)
- Data staging
area archive logs and recovery procedures
- Data
staging-archive security settings.
Once
we have finally transferred the data to the data mart DBMS, then we must have
metadata, including:
- DBMS system
tables
- Partition
settings
- Indexes
- Disk
striping specifications
- Processing
hints
- DBMS-level
security privileges and grants
- View
definitions
- Stored
procedures and SQL administrative scripts
DBMS backup status, procedures, and security. In the front room, we have
metadata extending to the horizon, including:
- Precanned
query and report definitions
- Join
specification tool settings
- Pretty print
tool specifications (for relabeling fields in readable ways)
- End-user
documentation and training aids, both vendor supplied and IT supplied
- Network
security user privilege profiles, authentication certificates, and usage
statistics, including logon attempts, access attempts, and user ID by
location reports
- Individual
user profiles, with links to human resources to track promotions,
transfers, and resignations that affect access rights
- Links to
contractor and partner tracking where access rights are affected
- Usage and
access maps for data elements, tables, views, and reports
- Resource
charge back statistics
- Favorite Web
sites (as a paradigm for all data warehouse access).
Now
we can see why we didn’t know what this metadata was all about. It is
everything! Except for the data itself. Suddenly, the data seems like the
simplest part.
With
this perspective, do we really need to keep track of all this? We do, in my
opinion. This list of metadata is the essential framework of your data
warehouse. Just listing it as we have done seems quite helpful. It’s a long
list, but we can go down through it, find each kind of metadata, and identify
what it is used for and where it is stored.
There
are some sobering realizations, however. Much of this metadata needs to reside
on the machines close to where the work occurs. Programs, settings, and
specifications that drive processes have to be in certain destination locations
and in very specific formats. That isn’t likely to change soon.
Once
we have taken the first step of getting our metadata corralled and under
control, can we hope for tools that will pull all the metadata together in one
place and be able to read and write it as well? With such a tool, not only
would we have a uniform user interface for all this disparate metadata, but on
a consistent basis we would be able to snapshot all the metadata at once, back
it up, secure it, and restore it if we ever lost it.
Don’t
hold your breath. As you can appreciate, this is a very hard problem, and
encompassing all forms of metadata will require a kind of systems integration
that we don’t have today. I believe the Metadata Coalition (a group of vendors
trying seriously to solve the metadata problem) will make some reasonable
progress in defining common syntax and semantics for metadata, but it has been
two years and counting since they started this effort. Unfortunately, Oracle,
the biggest DBMS player, has chosen to sit out this effort and has promised to
release its own proprietary metadata standard. Other vendors are making serious
efforts to extend their product suites to encompass many of the activities
listed in this article and simultaneously to publish their own framework for
metadata. These vendors include Microsoft, who’s working with the Metadata
Coalition to extend the Microsoft Repository, as well as a pack of aggressive,
smaller players proposing comprehensive metadata frameworks, including Sagent,
Informatica, VMark, and D2K. In any case, these vendors will have to offer
significant business advantages in order to compel other vendors to write to
their specifications.