Towards a IIIF-based corpus management platform

Abstract : The digital text platform is part of the Flemish contribution to DARIAH Belgium (DARIAH = Digital Research Infrastructure for the Arts and Humanities). The goal is to create a platform for the collaborative management and discovery of digitised textual collections that allows digital humanities researchers to prepare their corpora (consisting of, for example, digitised newspapers and books) for textual analysis. The platform will enable researchers to browse and search the digitised collections compiled, cleaned, enriched and managed by the researchers themselves. Once the relevant research sub-corpus has been compiled, data export tools, using standardised open formats (such as XML, JSON, .csv, .txt, etc.) will enable researchers to export sub-corpus for analysis with existing digital text analysis tools such as MALLET, (http://mallet.cs.umass.edu/topics.php) for topic modelling, VOYANT (http://voyant-tools.org) for data visualisation or AntConC (http://www.laurenceanthony.net/software/antconc/) for concordance and textual analysis. The platform has been conceived as part of a larger and modular virtual research environment service infrastructure (http://www.ghentcdh.ugent.be/projects/dariah-vl_vre.si). In a previous phase, possible frameworks and content management systems were tested, notably Islandora (a digital asset management system based on Fedora Commons and Drupal), but also Mediawiki and Omeka. One of the main challenges of the envisaged new platform is the possibility to integrate a wider variety of possible textual data streams (including a scan workflow). In addition, user-friendliness, scalability, adherence to standards and facilitating the interoperability of data are key issues to be addressed. The platform will build on the existing IIIF format, the International Image Interoperability Framework. This format is used by some of the most important libraries and cultural heritage institutions in the world, therefore providing access to enormous collections of digital objects. As the name suggests, IIIF is mainly focused on displaying and annotating images. However, we fully endorse the IIIF-community’s vision to develop an overarching interoperability framework for other data types, including all kinds of textual data. Benefits of the format include the interoperability, the ease of sharing images and annotations without the need to exchange files, and its support for multilingual data. In the months leading up to the conference, we will evaluate the existing IIIFpowered digital libraries and research projects and how they deal with practices of co-creation, data cleaning and enrichment of (structural) metadata. OCR improvement will become vital, as digital textual analysis can only be performed well on high-quality textual data. A related challenge will be combining the various input formats and converting them to different output formats required for analysis. In our poster, we will present a summary of our experiences with and technical assessment of our previous Islandora installation, in addition to our survey of the existing corpus management solutions. As a way of conclusion, we will introduce the envisioned new version of the platform.
Type de document :
Poster
Digital Humanities Benelux 2017, Jul 2017, Utrecht, Netherlands. 〈https://dhbenelux2017.eu/〉
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-01571618
Contributeur : Joke Daems <>
Soumis le : jeudi 3 août 2017 - 10:24:54
Dernière modification le : samedi 5 août 2017 - 01:04:39

Fichier

CMP_poster3.pdf
Fichiers produits par l'(les) auteur(s)

Licence


Distributed under a Creative Commons Paternité 4.0 International License

Identifiants

  • HAL Id : hal-01571618, version 1

Collections

Citation

Joke Daems, Sally Chambers, Christophe Verbruggen, Tecle Zere. Towards a IIIF-based corpus management platform. Digital Humanities Benelux 2017, Jul 2017, Utrecht, Netherlands. 〈https://dhbenelux2017.eu/〉. 〈hal-01571618〉

Partager

Métriques

Consultations de la notice

54

Téléchargements de fichiers

15