Report: Metadata Seminar - Canberra 6th March 1997or
How can I find what I am looking for?This is an on-line document with hypertext links:
Metadata is information that describes other information. Catalogue records for library materials are a common example of metadata. On the web, metadata is usually contained within the head tag. The most useful form of metadata is that which assists in discovering exactly the information you want. Usually metadata describes the contents, physical description, location, type and form of the resource, and data necessary for management including migration history, expiry dates, security, authentication, and file formats.Overall the seminar in Canberra reflected the archival/librarian community's inability to leave their comfort zone and come to grips with the real issues inherent in electronically networked media. Most arguement and discussion is concerned with document-like objects (things which can be printed) instead of fluidly generated (dynamic) objects which are continuously changing.
Three types of metadata schemes are currently used for networked electronic documents:
- automatically generated indexes used by locater services such as WebCrawler, Alta Vista, Open Text, Lycos etc;
- cataloguing records, such as MARC (Machine-Readable Bibliographic Information Committee), created by professional information providers.
- various proprietory architectures sometimes suited to the material they are creating such as:
PICSThe Platform for Internet Content Selection (PICS) developed by the W3C, is another metadata standard aimed at describing the content of Internet documents, in particular, the ratings on sensitive material (such as the level of nudity or violence). PICS does not actually specify any rating service, but the syntax and infrastructure to define such services (using labels). Third parties have already designed rating services (such as RSACi and SafeSurf). However, PICS also facilitates other uses for labels, including code signing, privacy, and intellectual property rights management. The implementation of PICS baffles me.
Is there a simple guide?
Automatically generated records often contain too little information to be useful, while manually generated records are costly to create and maintain.
Ideally, providing metadata should be an integral part of creating documents. So, W3C is attempting to evolve a standard set of metadata that is simple enough for easy use by creators and maintainers of digital information, yet sufficiently descriptive to help users locate the resources.
The Dublin Core:lists 15 elements that describe what is considered the essential features of electronic documents. A sample of these in use can be seen by viewing the source of this document. It looks like this:<HEAD> <TITLE> Report on Metadata Seminar</TITLE> <meta name="Title" content="Report on Metadata Seminar - 6th March"> <meta name="Description" content="description of metadata and evolving standards for digital information"> <meta name="Creator" content="Simon Pockley"> <meta name="Keywords" content="describe, description, digital, electronic, Dublin Core, contextual information, record description, evidence"> </HEAD>If you want to see the Dublin Core in action, view the Document Source code of the site Resource Discovery Unit. Here you will also find a wealth of valuable information on resource discovery related topics, these screens include metadata that complies with both the Dublin Core scheme and the PICS scheme.
- Title Label: TITLE
The name given to the resource by the CREATOR or PUBLISHER.
- Author or Creator Label: CREATOR
The person(s) or organization(s) primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.
- Subject and Keywords Label: SUBJECT
The topic of the resource, or keywords or phrases that describe the subject or content of the resource. The intent of the specification of this element is to promote the use of controlled vocabularies and keywords. This element might well include scheme-qualified classification data (for example, Library of Congress Classification Numbers or Dewey Decimal numbers) or scheme-qualified controlled vocabularies (such as MEdical Subject Headings or Art and Architecture Thesaurus descriptors) as well.
- Description Label: DESCRIPTION
A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources. Future metadata collections might well include computational content description (spectral analysis of a visual resource, for example) that may not be embeddable in current network systems. In such a case this field might contain a link to such a description rather than the description itself.
- Publisher Label: PUBLISHER
The entity responsible for making the resource available in its present form, such as a publisher, a university department, or a corporate entity. The intent of specifying this field is to identify the entity that provides access to the resource.
- Other Contributors Label: CONTRIBUTORS
Person(s) or organization(s) in addition to those specified in the CREATOR element who have made significant intellectual contributions to the resource but whose contribution is secondary to the individuals or entities specifed in the CREATOR element (for example, editors, transcribers, illustrators, and convenors).
- Date Label: DATE
The date the resource was made available in its present form. The recommended best practice is an 8 digit number in the form YYYYMMDD as defined by ANSI X3.30-1985. In this scheme, the date element for the day this is written would be 19961203, or December 3, 1996. Many other schema are possible, but if used, they should be identified in an unambiguous manner.
- Resource Type Label: TYPE
The category of the resource, such as home page, novel, poem, working paper, preprint, technical report, essay, dictionary. It is expected that RESOURCE TYPE will be chosen from an enumerated list of types. A preliminary set of such types can be found at the following URL:
- Format Label: FORMAT
The data representation of the resource, such as text/html, ASCII, Postscript file, executable application, or JPEG image. The intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). As with RESOURCE TYPE, FORMAT will be assigned from enumerated lists such as registered Internet Media Types (MIME types). In principal, formats can include physical media such as books, serials, or other non-electronic media.
- Resource Identifier Label: IDENTIFIER
String or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers,such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element.
- Source Label: SOURCE
The work, either print or electronic, from which this resource is derived, if applicable. For example, an html encoding of a Shakespearean sonnet might identify the paper version of the sonnet from which the electronic version was transcribed.
- Language Label: LANGUAGE
Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with the Z39.53 three character codes for written languages.
- Relation Label: RELATION
Relationship to other resources. The intent of specifying this element is to provide a means to express relationships among resources that have formal relationships to others, but exist as discrete resources themselves. For example, images in a document, chapters in a book, or items in a collection. A formal specification of RELATION is currently under development. Users and developers should understand that use of this element should be currently considered experimental.
- Coverage Label: COVERAGE
The spatial locations and temporal durations characteristic of the resource. Formal specification of COVERAGE is currently under development. Users and developers should understand that use of this element should be currently considered experimental.
- Rights Management Label: RIGHTS
The content of this element is intended to be a link (a URL or other suitable URI as appropriate) to a copyright notice, a rights-management statement, or perhaps a server that would provide such information in a dynamic way. The intent of specifying this field is to allow providers a means to associate terms and conditions or copyright statements with a resource or collection of resources. No assumptions should be made by users if such a field is empty or not present.
Current IssuesOverall, the issues are bipolar. On the one hand are the forces of complexity which make schemes, descriptions and qualifications so complex and arcane that while they might be exhaustive in content, very few people have any idea of what they are about - let alone how to implement them. On the other, the need for simplicity by which agreed types of metadata can be generated by anybody - but where description is, at best, basic.
The advantage of embedding the metadata within the resource is the tight coupling between the metadata and the resource. Whenever the resource is copied or moved, the metadata goes with it. Whenever the resource is modified, the metadata is right there to be modified as well. When the resource is deleted, the metadata goes away as well.
- What additional metadata or descriptions can be added for specialised purposes?
The Warwick Framework (this paper by L. Dempsey and S. Weibel from D-Lib Magazine, July/August 1996) is a modular-type architecture that will allow for more specialised elements to be incorporated into basic sets such as the Dublin Core.
- Where do you put metadata?
Consensus is yet to be reached on where metadata is best stored. In a report on the National Digital Library Project at the Library of Congress ('Historical Collections for the National Digital Library. Lessons and Challenges at the Library of Congress', Caroline R. Arms, D-Lib Magazine, April 1996), four ways of storing descriptive data are suggested:
- in the items themselves, for those file formats that support descriptive headers;
- in linked items, for example by scanning for each digitised book or document a "target" page on which details are recorded about the digitisation operation, such as the item's logical name, the equipment used, and special instructions for conversion;
- in external catalogues or finding aids, to support the identification of relevant items by searching or browsing; or
- integrated with the digital object in the repository structure, to support retrieval of an identified item.
In order to significantly improve the discovery problems that currently characterise the Internet, provision of tools to enable authors to better describe their resources must evolve.
The next few years will see work directed towards
There is a real opportunity for Cinemedia to be pioneers in the area of nondocument-like object description and how the use of metadata can integrate management roles and procedures.
- Integrating different metadata sets together
- Building metadata production and management tools
- Designing ways to extend metadata standards
cobbled together by Simon Pockley