Seminar: defoe: A Spark-based Toolbox for Analysing Digital Historical Textual Data
In this talk will present defoe, a new scalable and portable digital toolbox that enables historical research. It allows for extracting knowledge from historical data by running text analyses across large digital collections, such as historical newspapers and books in parallel. It offers a rich set of text mining queries, which have been defined by humanities researchers. We have included NLP prepossessing techniques to mitigate against OCR errors and standardise the textual data. We have tested defoe using six different large-scale historical text datasets and three HPC environments, as well as on desktops.
- Date: October 4, 2019
- Time: 11am PT / 2pm ET
- Location: 6th floor large Conference Room #689, Information Sciences Institute, Marina del Rey, CA, USA
Bio
Rosa Filgueira is a research fellow at EPCC (University of Edinburgh), working in several national and international funded projects. Before that, she was working as a Senior Data Scientist at the British Geological Survey, as a Senior Research Associate at the Data Intensive Research Group of the University Edinburgh and as a Research and Teaching Assistant at the Computer Architecture Group of University Carlos III Madrid. Her research is concerned with two closely topics. The first one is to develop adaptive communication techniques which optimise the data movement for data-intensive applications at different HPC levels. The second one is to facilitate the development of scientific workflows/application that can by run in many HPC environments while hiding the complexity to the users.