Project Collaboratory (English)
Collaboratory for Annotation, Indexing and Retrieval of Digitized Historical Archive Material
The R&D project (IST-1999-20882) was funded by the EU within the "Digital Heritage and Cultural Content" activities. It ran from September 2000 until the end of 2003.
Within we designed, implemented and evaluated in real life a highly innovative Web-based collaboratory for archives, researchers and end-users working with digitized historic material. It is one of the first working collaboratories in the Humanities. offers new ways of document-centered knowledge work to distributed user groups. European film heritage and censorship processes in the 1920s and 1930s were chosen as an example domain for the project. The developed technologies, however, can easily be adapted to other application domains and usage contexts which are similarly information-intensive.
The current collection of rare historic documents was provided by three major film and national archives from Germany, Austria and the Czech Republic. It consists of about 20000 digitized document pages describing film censorship procedures related to historic films and enriched context documentation including press material and digitized photos and film fragments. Members of these institutions - film historians and archivists - worked as pilot users, employing the system for detailed cataloguing of the document collection and for in-depth content indexing and annotation of relevant sub-collections.
At the end of the project we established both an innovative Web-based collaboratory with a comfortable work environment for in-depth knowledge work with the material and a comprehensive, selected digitized collection of rare historic documents on European historic film that was interpreted and annotated by a multinational team of film experts.
Since the end of the project the achieved results have been maintained and made available to the public. A public "readers" version of the system was offered for free download from the project's Website for almost two years until April 2006.
Project Vision
Currently, numerous valuable historic and cultural sources – a major part of our cultural heritage – are imperiled and scattered in various national archives. Thus, full knowledge and usage of this material are on the one hand severely impeded by access problems due to: (a) difficult-to-use or electronically unavailable sources, i.e. both documents and formal reference systems, and (b) the lack of appropriate content-based search and retrieval aids that would help users find what they really need. Although many informal cooperations between cultural archives exist that constitute specific professional communities these communities still lack effective and efficient technological support for collaborative knowledge working. Technologically, the World Wide Web can serve both as standard communication platform for such communities and as gateway for document-centered digital library applications.
Within the project one of the first working collaboratories in the Humanities was to be established that provides a complex but comfortable work environment to support collaborating expert users in their various, ambitious research tasks. We designed and implemented a WWW-based collaboratory for archives, researchers and end-users working with digitized historic/cultural material. As example domain we used historic film documentation, employing digitized multi-format, multimedia documents on several thousands European early 20th century films. The XML document repository consists of a large corpus of historic text documents (especially on film censorship processes of the 1920s and 1030s), and for about 100 significant films enriched documentation including photos, posters and digital film fragments. Sources were provided by three major European film archives from Germany, Austria and the Czech Republic, as well as by several collaborating state archives that made available special collections for. The developed tools and interfaces in , however, were designed to be generic, i.e. to be easily adaptable to other content domains, types of applications and user types.
Developing a “collaboratory in use”, we pursued two complementary overall goals in :
-
Ensure collaborative accessibility of cultural heritage: Implementation of a content-centric, user-driven information system and working environment on top of a distributed multimedia repository, employing comfortable Web-based tools and interfaces for collaborative work with and content-based access to the digital repository.
-
Establish evidence for the acceptability of a collaboratory in the historical domain: Documentation of experiences of the professionals' real-life work with the system, and empirical evaluation of the actual usage of the collaboratory by different user groups, e.g., for "preservation case studies" or other complex scholarly work.
In this approach, technology development and empirical evaluation of the developed system in a real-life environment were closely intertwined. Outputs from both areas of project work strongly influenced each other to allow an iterative, dynamic system development. Evaluation steps were explicitly built in, and the users themselves were actively involved throughout the various development cycles (administrators, archivists, film scholars and other interested end-users).
All material was analyzed, indexed, annotated and interlinked by domain experts. The system (various versions throughout the project's lifecycle) provided them with appropriate task-based interfaces for in-depth indexing/annotation of documents and other tasks as well as with supporting knowledge management tools (indexing aids like thesauri and special keyword lists).
By this way, a growing body of manually indexed data and metadata emerged within the last two years of the project. The system can exploit this metadata by employing advanced XML-based knowledge management and retrieval methods. The final version of the online collaboratory additionally integrates innovative document processing and management facilities, e.g., XML-based document handling, digital watermarking, and semi-automatic segmentation, categorization and indexing of digitized text documents and pictorial material (photos, posters, film fragments).
Combining results from the manual and automatic indexing procedures, elaborate content-based retrieval mechanisms can be applied. This helps users find what they are actually looking for, to combine evidence from various sources and to interrelate so far unrelated sources and knowledge. Thus, not only the size and richness but also the quality, affordability and acceptability of the digital repository is constantly being improved.
The main results at the end of the project include:
-
a Web-based collaboratory that provides a comfortable working environment and user interfaces for supporting end-users in their annotation, indexing and retrieval of multi-format, multimedia historic archive material; and
-
a comprehensive digital multimedia collection on European historic films and film documentation (about 20000 pages of digitized document pages), annotated and interpreted by a multi-national team of experts using the software system developed within .
In sum, the system serves as a virtual knowledge and working environment for distributed user groups which supports individual work and collaboration of domain experts who are analyzing, evaluating, indexing and annotating the material. Particular emphasis is placed on supporting in-depth, interpretative analysis of the digitized sources. The system continuously integrates the hereby derived user knowledge into its digital data and metadata repositories, and on this basis can offer advanced content-based retrieval functionalities within the information system. Users are thus enabled to create and share valuable knowledge about the cultural, political and social contexts, which in turn allows other end-users to better access and interpret the historic material.
Collaboratory
William A Wulf coined the term "collaboratory" as merger of the words collaboration and laboratory and defined it as "a center without walls", i.e. a virtual research center in the Web, where professionals and lay persons are provided with means for interacting with colleagues, accessing instrumentation, sharing data and computational resources, and accessing information stored in digital libraries and archives. Whereas various collaboratories have been developed since the early 90ies mainly in natural and computer sciences, there exist so far - aside from some pre-studies and experimantal systems with limited functionality - only few comparable efforts in arts and humanities.
For the example domain of historic film documentation the system provides a working environment for the annotation, indexing and retrieval of digitized documents, supporting collaborative aspects of this work. Collaboration is here regarded as a social process where individuals or groups contribute unpublished parts of their work on material in the digital collection or just comment on certain issues, e.g., annotating previous annotations, interlinking documents or parts of them, etc. Hence, the original digitized documents and their annotations are complemented with value-added information. In this way results from a discourse between professionals can be incorporated as metadata into the virtual environment allowing for content- and context-based retrieval.
To accomplish a suitable platform for scholarly work in historic/cultural domains, the support of collaborative work must go beyond contemporary groupware products, featuring significant innovative functions such as:
- integration of advanced document processing and groupware functions to allow for collaborative inspection and interpretation of source material, e.g., tagging, annotating and interlinking;
- support of specific tasks and conventions in scholarly work, such as protection of intellectual property where individuals or groups contribute unpublished parts of their work or assets;
- organization of discussions and typical procedures of scholarly work, such as preparing a source edition or assembling and creating material for an exhibition or publication.
As the system stores rich knowledge about the documents' contents, which allows content-based access and support of complex tasks like, for instance, the preparation of a historic museum's exhibition. Using people from various archives can inspect photos, discuss a selection of them and enlarge this selection by a detailed new search, e.g., for specific motifs of photos combined with certain visual and esthetic features. Or, a journalist who intends to describe the relationship between violence and media is now enabled to search with exact questions in historic censorship documents. A great deal of the archivists' work like film reconstructions or text editions rely on collaborations, which so far have mostly been realized by personal contacts, or stepping on information by chance. An online collaboratory allows us to establish this information flow and to access distributed sources with refined search options - and therefore to broaden the archive's scope decidedly.
Film Studies
The historic film domain demonstrates the necessity of an international collaboratory like in an impressive way. Films are not confined to national borders. Sometimes films are co-produced by several countries and they are usually distributed worldwide - from the beginning of film production. The versions of a film shown in different countries can differ. Often there are different language versions, sometimes with shortages, e.g., due to censorship restrictions. Sometimes scenes that might cause offence to the self-image of one nation are cut, sometimes actors are changed. So film culture, or the film as institution, is deeply involved in a country's cultural politics. Unfortunately, our knowledge of this film complex is limited as a result of the destruction or disappearance of historic film material.
Usually each film production company (or the importers of foreign productions) had to submit their films for approval before distribution and screening of films. The importance of censorship for film history lies mainly in the fact that it is often very difficult to identify the one unique film. Often, there are a lot of different film versions with cuts, changed endings and new inter-titles, depending on the place and date of release. Exactly these differences are documented in censorship documents and allow statements about the original film and the film versions respectively.
Censorship documents can give not only valuable information about the development of mass media and the democratic public sphere but often they provide the only source available today for the reconstruction of the large number of films that have been lost or destroyed. Copies of the film complemented by secondary material like censorship documents, film reviews, press, photos and advertising material are widely distributed in archives.
Today, such material is stored in national archives and provides only an incoherent understanding of film and cultural history. The complex cultural phenomenon film is disintegrated into a scattered puzzle - inaccessible and unknown. Therefore, archive work aims to reconstruct this "unity" and to define an integrated whole of this cultural heritage. This means, for instance, to put together film fragments from various copies in order to obtain a historically correct reconstruction, to use secondary material for intellectual reconstruction of the contents of lost films or cut film scenes, to add information from various sources to a coherent filmography and to compile all textual and visual documents concerning film in order to understand the meaning of this cultural representation. This unity is the prerequisite for the archivists' work and scientific research. A shared knowledge and information space replaces individual, distributed work and puts together all information into one database. From now on the distribution and reception, which had only poorly been interlinked across nations, is being represented in the collaboratory system, which is accessible worldwide via the Internet.
System Technologies
supports the proceeding digitization of cultural heritage corpora by establishing innovative models and techniques in the following areas: (1) it employs practicable methods of content and knowledge processing for traditionally isolated document collections; (2) it proposes a new concept of content-based organization, handling and presentation of impaired and precarious historical material; and (3) it supports the to date only informal cooperative community in arts and humanities by offering a comfortable online working environment to transfer tacit expert knowledge of the professionals into explicit knowledge through in-depth annotations.
Content-based access to cultural heritage contents that are both valuable and useful to these communities and the interested public must employ innovative features. To offer an information system with advanced access functions, we went beyond current practices of merely providing digital reproductions of and simple online access to historic sources. Instead, results from current and previous scholarly work such as evaluating and indexing these sources were incorporated into the information system, e.g., in the form of metadata and annotations, which in turn allows improved content-based data access.
as an interactive knowledge environment enables access to a distributed data repository of historic documents. Its pilot users (members of three film archives that participated in the project) were directly and indirectly involved in system development because they actively contrubuted to enriching the document repository through successive annotations and indexation. Annotation as a multifunctional means of in-depth analysis is very often done in a team effort, therefore implemented features supporting such collaborations (e.g., annotation of annotations, collaborative evaluation and comparison of documents). As a result of the users' work within the last two project years, a large amount of value-added information could be provided in addition to the digitized documents. This dynamic accumulation of additional data through annotations in turn required the data structures to be scaleable and extensible.
In order to capture these dynamics we chose XML as a de facto standard for the encoding of generic document and metadata representation schemata. Through the use of XML we are able to guarantee the generality of our approach, since these schemata can be enriched and tailored to additional sources and knowledge incorporated into our system without any need for re-modeling the whole system. In addition XML builds the basis for the integration of knowledge processing tools and retrieval functionality in the system. Therefore, is capable of capturing the dynamics of collaboration without neglecting the necessary flexibility of scaleable and extensible representation schemata, which can be transferred to other content domains as well.
As focuses on acceptability as well as on accessibility it was essential to facilitate the complex workflows in its domain of film documentation. For this reason we employed advanced models for task-oriented user interfaces for content-based analysis, annotation and retrieval.
Functional Structure
The collaboratory is a multi-functional software package integrating a large variety of functionalities, which are realized by cooperating software modules. It comprises several databases and different document representation schemata. XML is used as the uniform internal representation language for the documents in the repository and the associated metadata as well as for the implementation of the communication protocol among its system modules.
Document Pre-Processing
Digital Watermarking Engine - Through the use of digital watermarking aimed to ensure intellectual property rights in a public working environment by copyright watermarks, integrity watermarks, as well as digital signatures for the annotation and edition work of the expert users.
Intelligent Document Processing and Classification module - By using innovative machine learning techniques knowledge about digitized documents can be semi-automatically acquired and organized. This is achieved by automatic segmentation, layout analysis and classification of the scanned material.
Image and Video Analysis module - This module enables the automatic conceptual indexation and classification of photos and small film fragments. The results can be used to compare and associate the multimedia material with manually coded images or links to text documents.
System Layers
The system is structured into several functional layers.
Operational Layer - The Operational Layer can be described as a digital data repository. It comprises a variety of data, ranging from scanned-in text documents to multimedia data and the accumulated annotations related to one or more of these original data.
Domain Metadata Layer - In order to organize the stored data in a way that supports the complex knowledge-intensive tasks users want to perform on the repository contents suitable tools for metadata management are provided. The knowledge structures, which are represented by specific XML schemata, constitute the Domain Model. They comply with current-day metadata standards, but we in we also needed to implement extensions to cope with the especially rich structure of our domain.
Task Layer - The system allows a wide variety of user types to access, work with and evaluate the digitized archive material. It was designed to support complex working tasks in historic film documentation. Therefore a generic task model was developed in order to support work tasks like source edition, identification of lost or cut film scenes, preparation of a virtual exhibition, etc. As the focus is on collaboration, groupware-like functionalities were also included, thus allowing for collaborative inspection and interpretation of source material.
Interface Layer - In order to support the users in accomplishing their tasks provides appropriate interfaces for the work with the digital documents. These interfaces are semi- automatically derived from the underlying task models. Certain specialized interface components for annotation, mark-up, editing, search and retrieval were implemente to facilitate user interaction. The specification of the interface structure also utilizes XML to allow for a generic mapping to concrete instantiations.