Covid-on-the-Web: Knowledge Graph and Services to Advance COVID-19 Research

,


Bringing COVID-19 data to the LOD: deep and fast
In March 2020, as the Coronavirus infection disease  forced us to confine ourselves at home, the Wimmics research team 1 decided to join the effort of many scientists around the world who harness their expertise and resources to fight the pandemic and mitigate its disastrous effects. We started a new project called Covid-on-the-Web aiming to make it easier for biomedical researchers to access, query and make sense of the COVID-19 related literature. To this end, we started to adapt, re-purpose, combine and apply tools to publish, as thoroughly and quickly as possible, a maximum of rich and actionable linked data about the coronaviruses.
In just a few weeks, we deployed several tools to analyze the COVID-19 Open Research Dataset (CORD-19) [20] that gathers 50,000+ full-text scientific articles related to the coronavirus family. On the one hand, we adapted the ACTA platform 2 designed to ease the work of clinicians in the analysis of clinical trials by automatically extracting arguments and producing graph visualizations to support decision making [13]. On the other hand, our expertise in the management of data extracted from knowledge graphs , both generic or specialized, and their integration in the HealthPredict project [9,10], allowed us to enrich the CORD-19 corpus with different sources. We used DBpedia Spotlight [6], Entity-fishing 3 and NCBO BioPortal Annotator [12] to extract Named Entities (NE) from the CORD-19 articles, and disambiguate them against LOD resources from DBpedia, Wikidata and BioPortal ontologies. Using the Morph-xR2RML 4 platform, we turned the result into the Covid-on-the-Web RDF dataset, and we deployed a public SPARQL endpoint to serve it. Meanwhile, we integrated the Corese 5 [5] and MGExplorer [4] platforms to support the manipulation of knowledge graphs and their visualization and exploration on the Web.
By integrating these diverse tools, the Covid-on-the-Web project (sketched in Fig. 1) has designed and set up an integration pipeline facilitating the extraction and visualization of information from the CORD-19 corpus through the production and publication of a continuously enriched linked data knowledge graph. We believe that our approach, integrating argumentation structures and named entities, is particularly relevant in today's context. Indeed, as new COVID-19 related research is published every day, results are being actively debated, and moreover, numerous controversies arise (about the origin of the disease, its diagnosis, its treatment...) [2]. What researchers need is tools to help them get convinced that some hypotheses, treatments or explanations are indeed relevant, effective, etc. Exploiting argumentative structures while reasoning on named entities can help address these user's needs and so reduce the number of controversies.
The rest of this paper is organized as follows. In Section 2, we explain the extraction pipeline set up to process the CORD-19 corpus and generate the RDF dataset. Then, Section 3 details the characteristics of the dataset and services made available to exploit it. Sections 4 and 5 illustrate the current exploitation and visualization tools, and discuss future applications and potential impact of the dataset. Section 6 draw a review of and comparison with related works. The COVID-19 Open Research Dataset [20] (CORD-19) is a corpus gathering scholarly articles (ranging from published scientific publications to pre-prints) related to the SARS-Cov-2 and previous works on the coronavirus family. CORD-19's authors processed each of the 50,000+ full text articles, converted them to JSON documents, and cleaned up citations and bibliography links.
This section describes ( Fig. 1) how we harnessed this dataset in order to (1) draw meaningful links between the articles of the CORD-19 corpus and the Web of Data by means of NEs, and (2) extract a graph of argumentative components discovered in the articles, while respecting the Semantic Web standards. The result of this work is referred to as the Covid-on-the-Web dataset.

Building the CORD-19 Named Entities Knowledge Graph
The CORD-19 Named Entities Knowledge Graph (CORD19-NEKG), part of the Covid-on-the-Web dataset, describes NEs identified and disambiguated in the articles of the CORD-19 corpus using three tools: -DBpedia Spotlight [6] 7 Article metadata (e.g., title, authors, DOI) and content are described using DCMI 8 , Bibliographic Ontology (FaBiO) 9 , Bibliographic Ontology 10 , FOAF 11 and Schema.org 12 . NEs are modelled as annotations represented using the Web Annotation Vocabulary 13 . An example of annotation is given in Listing 1.1. The annotation body is the URI of the resource (e.g., from Wikidata) linked to the NE. The piece of text recognized as the NE itself is the annotation target. It points to the article part wherein the NE was recognized (title, abstract or body), and locates it with start and end offsets. Provenance information is also provided for each annotation (not shown in Listing 1.1) using PROV-O 14 , that denotes the source being processed, the tool used to extract the NE, the confidence of extracting and linking the NE, and the annotation author.
stated in the trial, as well as the evidence linked to this claim, and the PICO elements. In the context of clinical trials, a claim is a concluding statement made by the author about the outcome of the study. It generally describes the relation of a new treatment (intervention arm) with respect to existing treatments (control arm). Accordingly, an observation or measurement is an evidence which supports or attacks another argument component. Observations comprise side effects and the measured outcome. Two types of relations can hold between argumentative components. The attack relation holds when one component is contradicting the proposition of the target component, or stating that the observed effects are not statistically significant. The support relation holds for all statements or observations justifying the proposition of the target component. Each abstract of the CORD-19 corpus was analyzed by ACTA and translated into RDF to yield the CORD-19 Argumentative Knowledge Graph. The pipeline consists of four steps: (i) the detection of argumentative components, i.e. claims and evidence, (ii) the prediction of relations holding between these components, (iii) the extraction of PICO elements, and (iv) the production of the RDF representation of the arguments and PICO elements.
Component Detection. This is a sequence tagging task where, for each word, the model predicts if the word is part of a component or not. We combine the BERT architecture 16 [7] with an LSTM and a Conditional Random Field to do token level classification. The weights in BERT are initialized with specialised weights from SciBERT [1] and provides an improved representation of the language used in scientific documents such as in CORD-19. The pre-trained model is fine-tuned on a dataset annotated with argumentative components resulting in .90 f 1 -score [14]. As a final step, the components are extracted from the label sequences.
Relation Classification. Determining which relations hold between the components is treated as a three-class sequence classification problem, where the sequence consists of a pair of components, and the task is to learn the relation between them, i.e. support, attack or no relation. The SciBERT transformer is used to create the numerical representation of the input text, and combined with a linear layer to classify the relation. The model is fine-tuned on a dataset for argumentative relations in the medical domain resulting in .68 f 1 -score [14].
PICO Element Detection. We employ the same architecture as for the component detection. The model is trained on the EBM-NLP corpus [17] to jointly predict the participant, intervention 17 and outcome candidates for a given input. Here, the f 1 -score on the test set is .734 [13]. Each argumentative component is annotated with the PICO elements it contains. To facilitate structured queries, PICO elements are linked to Unified Medical Language System (UMLS) concepts with ScispaCy [16].
Argumentative knowledge graph. The CORD-19 Argumentative Knowledge Graph (CORD19-AKG) draws on the Argument Model Ontology (AMO) 18 , 16 BERT is a self-attentive transformer models that uses language model (LM) pre-training to learn a task-independent understanding from vast amounts of text in an unsupervised fashion. 17 The intervention and comparison label are treated as one joint class. 18 http://purl.org/spar/amo/ @prefix prov : < http :// www . w3 . org / ns / prov # >. @prefix schema : < http :// schema . org / >. @prefix aif : < http :// www . arg . dundee . ac . uk / aif # >. @prefix amo : < http :// purl . org / spar / amo / >. @prefix sioca : < http :// rdfs . org / sioc / argument # >. the SIOC Argumentation Module (SIOCA) 19 and the Argument Interchange Format 20 . Each argument identified by ACTA is modelled as an amo:Argument to which argumentative components (claims and evidence) are connected. The claims and evidences are themselves connected by support or attack relations (sioca:supports/amo:proves and sioca:challenges properties respectively). Listing 1.2 sketches an example. Furthermore, the PICO elements are described as annotations of the argumentative components wherein they were identified, in a way very similar to the NEs (as exemplified in Listing 1.1). Annotation bodies are the UMLS concept identifiers (CUI) and semantic type identifiers (TUI).

Automated Dataset Generation Pipeline
From a technical perspective, the CORD-19 corpus essentially consists of one JSON document per scientific article. Consequently, yielding the Covid-on-the-Web RDF dataset involves two main steps: process each document of the corpus to extract the NEs and arguments, and translate the output of both treatments into a unified, consistent RDF dataset. The whole pipeline is sketched in Fig. 1. Named entities extraction. The extraction of NEs with DBpedia Spotlight, Entity-fishing and BioPortal Annotator produced approximately 150,000 JSON documents ranging from 100KB to 50 MB each. These documents were loaded into a MongoDB database, and pre-processed to filter out unneeded or invalid data (e.g., invalid characters) as well as to remove NEs that are less than three characters long. Then, each document was translated into the RDF model described in Section 2.1 using Morph-xR2RML, 21 an implementation of the xR2RML mapping language [15] for MongoDB databases. The three NE extractors were deployed on a Precision Tower 5810 equipped with a 3.7GHz CPU and 64GB RAM. We used Spotlight with a pre-trained model 22 and Annotator's online API 23 with the Annotator+ features to benefit from the whole set of ontologies in BioPortal. To keep the files generated by Annotator+ of a manageable size, we disabled the options negation, experiencer, temporality, display links and display context. We enabled the longest only option, as well as the lemmatization option to improve detection capabilities. Processing the CORD-19 corpus with the NEs extractors took approximately three days. Mon-goDB and Morph-xR2RML were deployed on a separate machine equipped with 8 CPU cores and 48GB RAM. The full processing, i.e., spanning upload in Mon-goDB of the documents produced by the NE extractors, pre-processing and RDF files generation, took approximately three days.
Argumentative graph extraction. Only the abstracts longer than ten sub-word tokens 24 were processed by ACTA to ensure meaningful results. In total, almost 30,000 documents matched this criteria. ACTA was deployed on a 2.8GHz dual-Xeon node with 96GB RAM, and processing the articles took 14 hours. Like in the NEs extraction, the output JSON documents were loaded into MongoDB and translated to the RDF model described in Section 2.2 using Morph-xR2RML. The translation to RDF was carried out on the same machine as above, and took approximately 10 minutes.

Publishing and Querying Covid-on-the-Web Dataset
The Covid-on-the-Web dataset consists of two main RDF graphs, namely the CORD-19 Named Entities Knowledge Graph and the CORD-19 Argumentative Knowledge Graph. A third, transversal graph describes the metadata and content of the CORD-19 articles. Table 1 synthesizes the amount of data at stake in terms of JSON documents and RDF triples produced. Table 2 reports some statistics against the different vocabularies used.
Dataset Description. In line with common data publication best practices [8], we paid particular attention to the thorough description of the Covidon-the-Web dataset itself. This notably comprises (1) licensing, authorship and provenance information described with DCAT 25 , and (2) vocabularies, interlinking and access information described with VOID 26 . The interested reader may look up the dataset URI 27 to visualize this information.
Reproducibility. In compliance with the open science principles, all the scripts, configuration and mapping files involved in the pipeline are provided in the project's Github repository under the terms of the Apache License 2.0, so that anyone may rerun the whole processing pipeline (from articles mining to loading RDF files into Virtuoso OS).
Dataset Licensing. Being derived from the CORD-19 dataset, different licences apply to the different subsets of the Covid-on-the-Web dataset. The subset corresponding to the CORD-19 dataset translated into RDF (including articles metadata and textual content) is published under the terms of the CORD-19 license. 28 In particular, this license respects the sources that are copyrighted. The subset produced by mining the articles, either the NEs (CORD19-NEKG) or argumentative components (CORD19-AKG) is published under the terms of the Open Data Commons Attribution License 1.0 (ODC-By). 29 Sustainability Plan. In today's context, where new research is published weekly about the COVID-19 topic, the value of Covid-on-the-Web, as well as other related datasets, lies in the ability to keep up with the latest advances and ingest new data as it is being published. Towards this goal, we've taken care of producing a documented, repeatable pipeline, and we have already performed such an update thus validating the procedure. In the middle-term, we intend to improve the update frequency while considering (1) the improvements delivered by CORD-19 updates, and (2) the changing needs to be addressed based on the expression of new application scenarios (see Section 5). Furthermore, we have deployed a server to host the SPARQL endpoint that benefits from a highavailability infrastructure and 24/7 support.

Visualization and Current Usage of the Dataset
Beyond the production of the Covid-on-the-Web dataset, our project has set out to explore ways of visualizing and interacting with the data. We have developed a tool named Covid Linked Data Visualizer 30 comprising a query web interface hosted by a node.js server, a transformation engine based on the Corese Semantic Web factory [5], and the MGExplorer graphic library [4]. The web interface enables users to load predefined SPARQL queries or edit their own queries, and execute them against our public SPARQL endpoint. The queries are parameterized by HTML forms by means of which the user can specify search criterion, e.g., the publication date. The transformation engine converts the JSON-based SPARQL results into the JSON format expected by the graphic library. Then, exploration of the result graph is supported by MGExplorer that encompasses a set of specialized visualization techniques, each of them allowing to focus on a particular type of relationship. Fig. 2 illustrates some of these techniques: nodeedge diagram (left) shows an overview of all the nodes and their relationships; ClusterVis (top right) is a cluster-based visualization allowing the comparison of node attributes while keeping the representation of the relationships among them; IRIS (bottom right) is an egocentric view for displaying all attributes and relations of a particular node. The proposed use of information visualization techniques is original in that it provides users with interaction modes that can help them explore, classify and analyse the importance of publications. This is a key point for making the tools usable and accessible, and get adoption.
During a meeting with some health and medical research organisations (i.e., Inserm and INCa), an expert provided us with an example query that researchers would be interested in solving against a dataset like the one we generated: "find the articles that mention both a type of cancer and a virus of the corona family". Taking that query as a first competency question, we used Covid Linked Data Visualizer whose results are visualized with the MGExplorer library (Fig. 2). We also created several Python and R Jupyter notebooks 31 to demonstrate the transformation of the result into structures such as Dataframes 32 for further analysis (Fig. 3).
Let us finally mention that, beyond our own uses, the Covid-on-the-Web dataset is now served by the LOD Cloud cache hosted by OpenLink Software. 33

Potential Impact and Reusability
To the best of our knowledge, the Covid-on-the-Web dataset is the first one integrating NEs, arguments and PICO elements into a single, coherent whole. We are confident that it will serve as a foundation for Semantic Web applications as well as for benchmarking algorithms and will be used in challenges. The resources and services that we offer on the COVID-19 literature are of interest for health organisations and institutions to extract and intelligently analyse information on a disease which is still relatively unknown and for which research is constantly evolving. To a certain extent, it is possible to cross-reference information to have a better understanding of this matter and, in particular, to initiate research into unexplored paths. We also hope that the openness of the data and code will allow contributors to advance the current state of knowledge on this disease which is impacting the worldwide society. In addition to being interoperable with central knowledge graphs used within the Semantic Web community, the visualizations we offer through MGExplorer and Notebooks show the potential of these technologies in other fields, e.g., the biomedical and medical ones.
Interest of communities in using the Dataset and Services. Several biomedical institutions have shown interest in using our resources, eithet direct project partners (French Institute of Medical Research -Inserm, French National Cancer Institute -INCa) or indirect (Antibes Hospital, Nice Hospital). For now, these institutions act as potential users of the resources, and as co-designers. Furthermore, given the importance of the issues at stake and the strong support that they can provide in dealing with them, we believe that other similar institutions could be interested in using the resources.
Documentation/Tutorials. For design rationale purposes, we keep records of the methodological documents we use during the design of the resources (e.g., query elicitation documents), the technical documentation of the algorithms and models 34 , the best practices we follow (FAIR, Cool URIs, five-star linked data, etc.) and the end users help (e.g., demonstration notebooks).
Application scenarios, user models, and typical queries. Our resources are based on generic tools that we are adapting to the COVID-19 issue. Precisely, having a user-oriented approach, we are designing them according to three main motivating scenarios identified through a need analysis of the biomedical institutions with whom we collaborate: Scenario 1: Helping clinicians to get argumentative graphs to analyze clinical trials and make evidence-based decisions. Scenario 2: Helping hospital physicians to collect ranges of human organism's substances (e.g., cholesterol) from scientific articles, to determine if the substances' levels of their patients are normal or not. Scenario 3: Helping missions heads from a Cancer Institute to collect scientific articles about cancer and coronavirus to elaborate research programs to deeper study the link between cancer and coronavirus.
The genericity of the basic tools will allow us to later on apply the resources to a wider set of scenarios, and our biomedical partners already urge us to start thinking of scenarios related to other issues than the COVID-19.
Besides the scenarios above, we are also eliciting representative user models (in the form of personas), the aim of which is to help us -as service designersto understand our users' needs, experiences, behaviors and goals.
We also elicited meaningful queries from the potential users we interviewed. These queries serve to specify and test our knowledge graph and services. For genericity purposes, we elaborated a typology from the collected queries, using dimensions such as: Prospective vs. Retrospective queries or Descriptive (requests for description) vs. Explanatory (requests for explanation) vs. These queries are a brief illustration of an actual (yet non-exhaustive) list of questions raised by users. It is worthy of notice that whilst some questions might be answered by showing the correlation between components (e.g., types of cancer), others require the representation of trends (e.g., cancer likely to occur in the next years), and analysis of specific attributes (e.g., details about metabolic changes caused by . Answering these complex queries requires exploration of the CORD-19 corpus, and for that we offer a variety of analysis and visualization tools. These queries and the generic typology shall be reused in further extensions and other projects. The Covid Linked Data Visualizer (presented in section 4) supports the visual exploration of the Covid-on-the-Web dataset. Users can inspect the attributes of elements in the graph resulting from a query (by positioning the mouse over elements) or launch a chained visualization using any of the interaction techniques available (ex. IRIS, ClusterVis, etc). These visualization techniques are meant to help users understand the relationships available in the results. For example, users can run a query to visualize a co-authorship network; then use IRIS and ClusterVis to understand who is working together and on which topics. They can also run a query looking for papers mentioning the COVID-19 and diverse types of cancer. Finally, the advanced mode makes it possible to add new SPARQL queries implementing other data exploration chains.

Related Works
Since the first release of the CORD-19 corpus, multiple initiatives, ranging from quick-and-dirty data releases to the repurposing of existing large projects, have started analyzing and mining it with different tools and for different purposes. Entity linking is usually the first step to further processing or enriching. Hence, not surprisingly, several initiatives have already applied these techniques to the CORD-19 corpus. CORD-19-on-FHIR 35 results of the translation of the CORD-19 corpus in RDF following the HL7-FHIR interchange format, and the annotation of articles with concepts related to conditions, medications and procedures. The authors also used Pubtator [21] to further enrich the corpus with concepts such as gene, disease, chemical, species, mutation and cell line. KG-COVID-19 36 seeks the lightweight construction of KGs for COVID-19 drug repurposing efforts. The KG is built by processing the CORD-19 corpus and adding NEs extracted from COVIDScholar.org and mapped to terms from biomedical ontologies. Covid9-PubAnnotation 37 is a repository of text annotations concerning CORD-19 as well as LitCovid and others. Annotations are aggregated from multiple sources and aligned to the canonical text that is taken from PubMed and PMC. The Machine Reading for COVID-19 and Alzheimer's 38 project aims at producing a KG representing causal inference 35 https://github.com/fhircat/CORD-19-on-FHIR 36 https://github.com/Knowledge-Graph-Hub/kg-covid-19/ 37 https://covid19.pubannotation.org/ 38 https://github.com/kingfish777/COVID19 extracted from semantic relationships between entities such as drugs, biomarkers or comorbidities. The relationships were extracted from the Semantic MEDLINE database enriched with CORD-19. CKG-COVID-19 39 seeks the discovery of drug repurposing hypothesis through link prediction. It processed the CORD-19 corpus with state of the art machine reading systems to build a KG where entities such as genes, proteins, drugs, diseases, etc. are linked to their Wikidata counterparts.
When comparing Covid-on-the-Web with these other initiatives, several main differences can be pointed out. First, they restrict processing to the title and abstract of the articles, whereas we process the full text of the articles with Entityfishing, thus providing a high number of NEs linked to Wikidata concepts. Second, these initiatives mostly focus on biomedical ontologies. As a result, the NEs identified typically pertain to genes, proteins, drugs, diseases, phenotypes and publications. In our approach, we have not only considered biomedical ontologies from BioPortal, but we have also extended this scope with two general knowledge bases that are major hubs in the Web of Data: DBpedia and Wikidata. Finally, to the best our knowledge, our approach is the only one to integrate argumentation structures and named entities in a coherent dataset.
Argument(ation) Mining (AM) [3] is the research area aiming at extracting and classifying argumentative structures from text. AM methods have been applied to heterogeneous types of textual documents. However, only few approaches [22,11,14] focused on automatically detecting argumentative structures from textual documents in the medical domain, e.g., clinical trials, guidelines, Electronic Health Records. Recently, transformer-based contextualized word embeddings have been applied to AM tasks [18,14]. To the best of our knowledge, Covid-on-the-Web is the first attempt to apply AM to the COVID-19 literature.

Conclusion and Future Works
In this paper, we described the data and software resources provided by the Covid-on-the-Web project. We adapted and combined tools to process, analyze and enrich the CORD-19 corpus, to make it easier for biomedical researchers to access, query and make sense of COVID-19 related literature. We designed and published a linked data knowledge graph describing the named entities mentioned in the CORD-19 articles and the argumentative graphs they include. We also published the pipeline we set up to generate this knowledge graph, in order to (1) continue enriching it and (2) spur and facilitate reuse and adaptation of both the dataset and the pipeline. On top of this knowledge graph, we developed, adapted and deployed several tools providing Linked Data visualizations, exploration methods and notebooks for data scientists. Through active interactions (interviews, observations, user tests) with institutes in healthcare and medical research, we are ensuring that our approach is guided by and aligned with the actual needs of the biomedical community. We have shown that with our dataset, we can perform documentary research and provide visualizations suited to the needs of experts. Great care has been taken to produce datasets and software that meet the open and reproducible science goals and the FAIR principles.
We identified that, since the emergence of the COVID-19, the unusual pace at which new research has been published and knowledge bases have evolved raises critical challenges. For instance, a new release of CORD-19 is published weekly, which challenges the ability to keep up with the latest advances. Also, the extraction and disambiguation of NEs was achieved with pre-trained models produced before the pandemic, typically before the SARS-Cov-2 entity was even created in Wikidata. Similarly, it is likely that existing terminological resources are being/will be released soon with COVID-19 related updates. Therefore, in the middle term, we intend to engage in a sustainability plan aiming to routinely ingest new data and monitor knowledge base evolution so as to reuse updated models. Furthermore, since there is no reference CORD-19 subset that has been manually annotated and could serve as ground truth, it is hardy possible to evaluate the quality of the machine learning models used to extract named entities and argumentative structures. To address this issue, we are currently working on the implementation of data curation techniques, and the automated discovery of frequent patterns and association rules that could be used to detect mistakes in the extraction of named entities, thus allowing to come up with quality enforcing measures.