Information Technology: The 21st Century Revolution
IT R&D Highlights -- Digital Libraries
LeftRight
Overview
DLI Phase Two directions
Human-centered research
Content and collections
Systems and testbeds
International efforts


Overview

The invention of the printing press confronted 15th-century scholars and publishers with challenging problems as they attempted to transform hand-drawn and -lettered documents into multiple printed representations while developing libraries to house and organize them. Even so, by 1501 printers had produced some 20 million copies of 35,000 manuscripts, fueling the expansion of literacy and the spread of human knowledge beyond the social elites. By comparison, today's fast-evolving capabilities of computing systems and networks make it possible to recreate, archive, display, and manipulate exponentially greater quantities of electronically generated documents, data, images, sounds, and video streams-and to offer potential instant access to this knowledge to a significant portion of the world's population. But organizing "collections" of these many forms of electronic information and developing systems and software tools to make them available to end users require complex technical innovations, including collaboration among experts from widely disparate fields of knowledge.
 
Launched in 1994, the Digital Libraries Initiative (DLI) addresses the conceptual, structural, and computational challenges that must be met before we can realize the vision of universally accessible electronic repositories of human knowledge. Despite the modest scale of the initial DLI program, early interdisciplinary successes of participating researchers--demonstrated in such commercial spinoffs as the Lycos and Google search engines and Go2Net-- attracted the attention of growing numbers of scholars and highlighted the enormous potential of digital information resources. DLI Phase Two, begun in FY 1999, spans a larger, more diverse set of research efforts that apply today's increasing computational and bandwidth capacities to the goal of making large-scale, distributed electronic collections accessible, interoperable, and usable through global knowledge networks. DLI Phase Two activities are jointly supported by NSF, DARPA, NIH/NLM, the Library of Congress, NASA, the National Endowment for the Humanities, and the FBI, in partnership with the National Archives and Records Administration, the Smithsonian Institution, and the Institute of Museum and Library Services.



DLI Phase Two directions

DLI Phase Two activities are drawing computer scientists and engineers from academia, industry, and government together with researchers and archivists in the humanities, the arts, and biomedical and physical sciences to develop new digital resource collections and testbed linkages among distributed archives; create frameworks, software, and network architectures that enable fusion of multimedia materials into unified records; resolve semantic problems that currently prevent integration of digital resources from distributed collections; experiment with system designs to ensure the preservation, integrity, and privacy of data; and explore and codify educational applications of digital materials. Phase Two research focuses on three essential dimensions of digital libraries:
  • Human-centered research--the ways in which digital libraries can improve and offer altogether new ways for people to create and use information
  • Content and collections--the kinds of human knowledge digital libraries can house and make available to users
  • Systems-centered research--the engineering, software, and taxonomic issues in creating and linking large-scale and disparate electronic collections via the Internet



Human-centered research


These activities explore next generation methods, algorithms, and software that can empower expanded educational, professional, and personal uses of high-quality digital information resources. Research focuses on development of intelligent search agents, improved abstracting and summarization techniques, advanced interfaces, and collaboration technologies and tools to enable individuals and groups to search for, retrieve, manipulate, and present electronic information archived in a variety of forms in adistributed network of source collections. Among the issues addressed by DLI Pha se Two efforts:

Personalized Retrieval and
Summarization of Image,
Video, and Language
Resources (PERSIVAL)
The explosion in Internet sites devoted to medical and health-related information makes it increasingly difficult for health care providers and consumers to find the most valuable and useful current resources. In the Personalized Retrieval and Summarization of Image, Video, and Language Resources (PERSIVAL) project, Columbia University researchers are experimenting with system designs to provide practitioners with quick and easy access to online medical resources tailored to individual patient needs. The goal is to develop personalized search and presentation tools to sort through distributed medical information, weed out repetitious and non-germane content, and summarize and present current findings that best match the real-time requirements of the practitioner or consumer. Using secure online patient records available at Columbia Presbyterian Medical Center as test models, the research team is linking a multimodal query interface with information extracted from a patient's medical record and user background to create a query graph for an online search of distributed medical resources. Search results are then filtered using natural language processing to provide the best matches with the patient's background. The results are presented in a customized multimedia format. http://www.cs.columbia.edu/diglib/PERSIVAL/

Digital resources designed for children The ways in which children ages 5 to 10 access, explore, and organize digital learning materials and the issues involved in creating learning environments suited to children's age-specific needs are the focus of a University of Maryland project. University researchers are fashioning developmentally appropriate tools for visualizing, browsing, querying, and organizing information in digital libraries designed for children. Audio, image, video, and text materials for the interdisciplinary research effort-which will include construction of a testbed digital collection about animals-are being made available by the Discovery Channel and the U.S. Department of the Interior's Patuxent (Maryland) Wildlife Research Center. http://www.cs.umd.edu/hcil/kiddiglib/


The University of Maryland has begun a partnership with children ages 5 to 10 and teachers from Yorktown Elementary School to create a multimedia children's digital library. This NSF-supported project will develop visual interfaces that support young children in querying, browsing, and organizing multimedia information, working with the children as "design partners" to develop new technologies that support the learning challenges in their age group.

Technologies and tools for
students
Technologies and tools to make online educational resources more accessible and useful to communities of older learners, including college students and adults, are under development in several collaborative research efforts. For example, researchers at the Hypermedia and Visualization Laboratory (HVL) at Georgia State University and the Association for Computing Machinery (ACM) SIGGRAPH Education Committee are developing a model for a reusable national collection of peer-reviewed undergraduate educational applications in XML and improved navigation capabilities using information visualization techniques based on XML and 3-D Web graphics. Related work by researchers at the University of South Carolina in association with collaboratories at the University of Iowa and Georgia State is creating a "Web-lab Library" of simulation software, experiments, and databases designed for students and researchers in the social and economic sciences. http://econ.badm.sc.edu/beam/

Video information collage Researchers at Carnegie Mellon University are creating an electronic workspace for video materials called a "video information collage," which will enable users to search for, view, and manipulate multiple video, text, image, and sound files from heterogeneous distributed sources. This will allow them to organize their discoveries into "chrono-collages" based on time relationships, "geo-collages" based on spatial relationships, or "auto-documentaries" preserving video's temporal nature. The research also involves creating a public video archive of recordings of historical, political, and scientific importance. http://www.informedia.cs.cmu.edu


Alexandria Digital Earth
Prototype (ADEPT)
The Alexandria Digital Earth Prototype (ADEPT) program is a component of a large-scale digital library collaboration of the University of California- Berkeley, the University of California-Santa Barbara (UCSB), Stanford University, SDSC, and the California Digital Library (CDL). The ADEPT project builds on a DLI Phase One project that used UCSB's map and imagery collections to create a large-scale geospatial digital archive, called the Alexandria Digital Library (ADL), featuring maps, aerial photos, gazetteer items, and bibliographic records. In the ADEPT effort, researchers are constructing-and will evaluate the educational effectiveness of-customizable learning environments based on the ADL's geographically referenced contents, enabling students to bookmark and organize information from heterogeneous resources and online services for multidisciplinary academic work. The ADEPT model employs a personalized interface called an Iscape, or Information landscape, with several layers of service and resource materials including meta-information tools indicating which resources in the personalized collection can be used collaboratively. http://alexandria.ucsb.edu/adept/

Power browsers Stanford researchers are experimenting with "power browsers"-handheld information appliances that access information sources, such as the Web, through wireless connections and software that maximizes the visual and navigation performance of very small displays. The software includes special information crawlers that save time by automatically performing certain search-related tasks. Researchers are also working on a large-scale "WebBase" database technology to store and index for subsequent searching or analysis millions of Web pages distributed across computers worldwide. http://www-diglib.stanford.edu/



Content and collections

Researchers are creating novel digital archives of sound, image, and video as well as textual records from broad knowledge domains and specific disciplines in the sciences, arts, and humanities. They are evaluating methods of digital representation, preservation, and storage; exploring effective metadata systems (standard structures for presenting the intellectual context and pertinent related information about records in a collection); expanding access to educational materials and courseware; and developing technologies and protocols for addressing related legal and societal issues, such as copyright protection, privacy, and intellectual property management. Current research activities include:

Digital library for the humanities Tufts University researchers, in partnership with the Max Planck Institute in Berlin, the Modern Language Association (MLA), the Boston Museum of Fine Arts, and the Stoa electronic publishing consortium, are developing the foundations for a scalable, interdisciplinary digital library accessible and useful to scholars as well as everyday Internet users. Materials included will date from ancient Egypt through 19th-century London. This site was processing 5 million requests per month in fall 1999. http://www.perseus.tufts.edu

National Gallery of the
Spoken Word (NGSW)
An interdisciplinary team at Michigan State University is building the Nation's first large-scale, fully searchable database and repository of historically significant audio materials spanning the 20th century. The "gallery" will also provide high-quality digital versions of such spoken words as Thomas Edison's first cylinder recordings and the voices of Babe Ruth and Florence Nightingale, with standard bibliographic and metadata access. A key research product will be a set of best practices for future Web sound development, including methods for conversion, preservation, access, and copyright compliance. http://www.ngsw.org/app.html

National digital library for
science, mathematics,
engineering, and technology
education (SMETE)
University of California-Berkeley researchers who developed the National Engineering Education Delivery System (NEEDS) digital library are exploring ways to expand the collection to encompass science, mathematics, and technology. The group is using its Web-based information portal, which supports cataloguing, searching, displaying, and reviewing of digital learning materials and courseware, to begin developing a SMETE digital library, demonstrate the online resource's capabilities, and evaluate the initial SMETE testbed collection. The NSF-supported effort aims to create a broad-based digital learning resource for K-12 and postsecondary education. http://www.needs.org

Digital Atheneum NSF-funded researchers at the Univesrity of Kentucky, in partnership with the British Library and with support fro m IBM's Shared University Research (SUR) program, are developing state-of-the-art techniques to digitally restore and enhance aging and damaged original documents and create searchable archives of such materials. Working with documents from the British Library's Cottonian Collection (which contains Greek, Hebrew, and Anglo-Saxon manuscripts collected by 17th century antiquarian Sir Robert Bruce Cotton), they are testing new methods to illuminate otherwise invisible text and markings on documents and create digital annotation systems and semantic frameworks for domain- and data-specific searches of these materials. http://www.digitalatheneum.org


The NSF-supported Digital Atheneum project, based at the University of Kentucky, is developing state-of-the-art technologies to restore severely damaged manuscripts, ultimately presenting an electronic digital library of restored and edited images of previously inaccessible manuscripts. Illustrated here is a damaged manuscript as seen by the human eye (left), and with hidden markings revealed by ultraviolet digitization (right).

Digital workflow management The more than 29,000 pieces of American popular sheet music in the Johns Hopkins University's Lester S. Levy Collection, already converted into digital records, will be made more accessible and usable through this project to create sound renditions and enhanced search capabilities. From items in the collection, which covers the period from 1790 to 1960, researchers will generate audio files and full-text lyrics using optical music recognition software written by staff of the Peabody Conservatory of Music at Johns Hopkins, and will develop workflow management tools to reduce and focus the human labor involved. The activities will result in a framework, tested process, and set of tools transferable to other large-scale digitization projects. http://levysheetmusic.mse.jhu.edu

A treasure of the Special Collections Department of the Johns Hopkins University's Milton S. Eisenhower Library, the Lester S. Levy Collection of sheet music contains more than 29,000 pieces of American popular music spanning the period from 1790 to 1960, with a particular strength in the 19th century. All pieces are indexed at the library's online site, and visitors to the site can retrieve a catalog description of each piece. Additionally, for music published prior to 1923 and now in the public domain, a cover image and a page of music can be downloaded from the site.


Data provenances

Research at the University of Pennsylvania addresses one of the most difficult aspects of online resource collections: the questions surrounding the origin, or provenance, of an electronic record-such as how old it is, how it was originally generated, who produced it, and who has modified it. These questions are even more challenging in electronic than in traditional archives because the material involved ranges from a single pixel in a digital image to an entire database. Drawing on concepts from emerging software for presenting structured documents on the Web, researchers will develop prototype document "attachments" where annotations regarding provenance can be stored and queried, providing new data models, query languages, and storage techniques. http://db.cis.upenn.edu/Research/provenance.html



Systems and testbeds

Systems research focuses on developing component technologies and the integration needed to create information environments that are dynamic and flexible; responsive at the individual, group, and institutional levels; and capable of continually adapting growing and changing bodies of data to new user-defined structures. These capabilities are prototyped and evaluated in testbed demonstrations that focus on media integration, software functionality, and breakthrough applications that offer transforming paradigms for social and work practices on a large scale.

New model for scholarly
publishing
The current print model of academic publishing, based on centralized control and restricted distribution, originated long before the start of the information age. In another component of the large-scale DL collaboration at California institutions, University of California-Berkeley researchers are developing technologies and tools to create a distributed, continuous, self-publishing paradigm to use and disseminate scholarly information in this era of instantaneous global communication. The publishing system prototypes will be tested and demonstrated in the emerging CDL and on a testbed developed by SDSC. http://elib.cs.berkeley.edu

Classification systems Among the most complex technical challenges of digital archives is how to adapt or re-invent standardized identification and classification schemes for their contents, as well as interoperable search architectures that users need to locate these resources. On top of traditional print catalog taxonomies, archivists of electronic artifacts are juggling a number of new content categories (for example, video, image, sound, and software programs), formats (such as the jpeg and gif formats for graphic images) and related operational annotations. Researchers at the University of Arizona, in partnership with SGI, NCSA, NIH's NLM and NCI, GeoRef Information Services, and Petroleum Abstracts, are working on an architecture and associated techniques to automatically generate classification systems from large domain-specific textual collections and unify them with manually created classification structures. To generate and test prototypes, they are parallelizing and benchmarking computationally intensive scalable automatic clustering methods for keyword searching on large-scale collections with existing classification systems such as CancerLit (700,000 abstracts) and the NLM's UMLS (500,000 medical concepts) in medicine; GeoRef and Petroleum Abstracts (800,000 abstracts) and GeoRef Thesaurus (26,000 terms) in geoscience; and on Web applications including a collection with 1.5 million Web pages and the Yahoo! classification system (20,000 categories). Using simulations on parallel, high performance platforms, scientists will optimize and evaluate the output of the various algorithms and develop hierarchical display methods to visualize the results. http://ai.bpa.arizona.edu/go/dl/

Virtual workspaces Even when digital collections are structured and catalogued internally, many users also need something akin to a large, well-lit library table on which they can spread out items from various sources to work with and compare. Harvard-MIT Data Center researchers are jointly designing a Virtual Data Center (VDC) to manage and share quantitative social-science materials for research and teaching among multiple institutions and the public. The VDC will link with other research centers and databases, enabling participants to deposit data in many formats and set terms of access to their materials. Users will be able to download data containing only the variables they specify. The VDC's suite of open software tools ultimately will be offered as a free, portable product. http://www.thedata.org

Security, quality, access, and reliability

In addition to effective classification systems and tools for users, the infrastructure of digital libraries, like that of their physical counterparts, requires systems ensuring the physical security of the collection, quality control, and remote access to the contents. Stanford University researchers are exploring ways to guarantee the long-term survival of digital information despite media obsolescence, natural disasters, and institutional change. They are prototyping techniques for automatically monitoring changes in collections and continuously "mirroring" the information into a large-scale archive that is automatically replicated at other sites. The prototype also uses mathematical models of projected failures in storage media to alert human operators to possible malfunctions.

At Cornell University, researchers are focusing on the integrity of digital library information, devising prototype administrative architectures to ensure that archived information is reliable and readily available, and that the intellectual-property rights of authors and the privacy rights of users are protected. http://www.prism.cornell.edu



International efforts

In only a few years, DL activities have expanded to encompass not only digital construction work on important human records but also international collaborations to facilitate universal access to these new information resources.

U.S.-U.K. activities An initiative of NSF and Britain's Joint Information Systems Committee, for example, supports international research to solve fundamental technical problems in linking and accessing geographically distributed materials in differing formats. These projects include:

  • A University of California-Berkeley/University of Liverpool Library study to enable cross-domain searching in a multidatabase environment. The aim is to produce Cheshire, a next generation online information retrieval system based on international standards, for Internet searches across collections of original materials, printed books, records, archives, manuscripts, museum objects, statistical databases, and full-text, geo-spatial, and multimedia data resources.
  • HARMONY, a three-way partnership of Cornell University, the Australian Distributed Systems Technology Centre, and the University of Bristol (UK) to devise a metadata framework for describing networked collections of complex mixed-media digital objects. The research will draw together work on the Resource Description Framework (RDF), XML, Dublin Core, and Moving Pictures Expert Group's MPEG-7-all of which are standards for representing and exchanging structured metadata. The goal is to enable multiple communities of library, education, and rights management experts to define overlapping descriptive vocabularies for annotating multimedia content. http://www.ilrt.bris.ac.uk/discovery/harmony/
  • A demonstration by Cornell University, the Los Alamos National Laboratory, and the University of Southampton (UK) will hyperlink each of the more than 100,000 papers in Los Alamos's online Physics Archive to every other paper in the archive that it cites. The project aims to highlight this powerful tool for navigating the scientific journal literature to encourage authors in other fields to join in creating similar hyperlinked online archives across disciplines and around the world. http://journals.ecs.soton.ac.uk/x3cites/
  • A tool for locating music itself online-a type of search that is not currently possible-is the goal of a collaborative effort by researchers at the University of Massachusetts and Kings College, London. This online music recognition and searching (OMRAS) tool will enable users to find musical information stored in online databases in formats ranging from encoded score files to digital audio. http://journals.ecs.soton.ac.uk/x3cites/
  • The IMesh Toolkit project, a partnership of the University of Wisconsin-Madison, the University of Bath (UK), and the University of Bristol (UK), uses an emerging approach to accessing Internet resources through highly selective, subject-specific Web sites called "subject gateways," and builds on the IMesh international collaboration among leading subject gateway developers. The research aims to advance the framework within which subject gateways and related services operate by defining an architecture specifying individual components and how they communicate. The architecture will allow interoperability and cross-searching between subject gateways developed in different countries. http://www.imesh.org/toolkit/
  • University of Michigan researchers, working with representatives of Britain's Consortium of University Research Libraries, are investigating the potential role of emulation in long-term preservation of information in digital form. The project will develop and test a suite of emulation tools; evaluate the costs and benefits of emulation as a preservation strategy for complex multimedia documents and objects; devise models for collection management decisions about investments in exact replication in preservation activity; assess options for preserving an object's functionality, "look," and "feel"; and generate preliminary guidelines for use of different preservation strategies such as conversion, migration, and emulation.

U.S.-Germany activities In January 2000, NSF/DLI Phase Two and Germany's Deutsche Forschungsgemeinschaft issued a joint call for collaborative proposals from U.S. and German university researchers on developing and organizing internationally accessible digital collections.

NSF-EU working groups The Joint NSF-European Union (EU) working groups on future directions for digital libraries research have completed their initial studies of national, technical, social, and economic issues and plans for common research agendas. Five working groups--each of which includes U.S. researchers from academia, industry, and government-addressed economic issues and intellectual property rights, interoperability among digital library systems, metadata, multilingual information access, and resource indexing and searching issues in globally distributed digital libraries. The final report, entitled "An International Research Agenda for Digital Libraries," and working papers can be found at: http://www.si.umich.edu/UMDL/EU_Grant/home.htm and http://www.iei.pi.cnr.it/DELOS//NSF/nsf.htm.

LeftRight

   
4201 Wilson Blvd, Suite II-405, Arlington, VA 22230 | (703) 292-4873 | (703) 292-9097 (fax)
  -
Home | Back to Top | Contact Us | Privacy Policy | Search
-