Performing and Visualizing Temporal Analysis of Large Text Data Issued for Open Sources: Past and Future Methods

. In this paper we ﬁrst propose a state of the art on the meth-ods for the visualization and the interpretation of textual data, in particular of scientiﬁc data. We then shortly present our contributions to this ﬁeld in the form of original methods for the automatic classiﬁcation of documents and easy interpretation of their content through characteristic keywords and classes created by our algorithms. In a second step, we focus our analysis on the data evolving over time. We detail our di-achronic approach, especially suitable for the detection and visualization of topic changes. This allows us to conclude with Diachronic’Explorer, our upcoming tool for visual exploration of evolutionary data.


Introduction
Databases of scientific literature and patents provide volumes of significant data for the study of scientific production.These data are also very rich and so complex.Indeed, the textual content of the publications, keywords used for archiving in these databases, the citations they contain and affiliations of the authors are as much information that it is possible to exploit for studying corpora of publications.These corpora are therefore a boon for the analysis of scientific and technical information.
In this article, we focus in particular on a major concern, which is to identify major changes related to developments in science and to describe them in a textual and visual way.Indeed, monitoring the development of transversal themes as well as detection of emerging themes or bridges between themes allows researchers to ensure of the innovative character of their area of research.
Furthermore, in managing the financing of research by the European Commission (EC), the detection of emerging issues is fundamental, as shown in the following examples.The NEST (New and Emerging Science and Technology) program was a specific EC program in FP6.Its objective was to encourage a visionary and unconventional research at the frontiers of knowledge and at the interface of disciplines.To organize this program, the EC launched a call for support actions to follow and evaluate the projects but also to identify future research opportunities.Similarly, alongside the thematic in ICT (Information & Communication Technologies) program, the European Commission has set up, in the 7th Framework Program (FP7), the FET program (Future and emerging technologies) to promote research in the long term, or high risk, but with potentially strong impact from a societal or industrial point of view 3 .
The detection of emerging technologies remains a complex process, and is therefore subject to studies in a broad spectrum of areas ranging from marketing to bibliometrics.
The selection tree proposed by [1] gives a good image of all forecasting methods that can be applied, in particular for the detection of these emerging technologies.It illustrates very well the dichotomy between quantitative methods and those based on expertise and shows the great diversity of existing approaches: Delphi and Nominal Group technique, methods based on the confrontation of the opinion of experts, scenario methods designed to scan different possible futures [20], until the methods combining the knowledge of the experts of a field and statistical techniques, allowing the identification of trends affecting causal factors.
The size and the complexity of the data that can be exploited to study the resulting databases of scientific publications require the development of quantitative methods for the detection of emerging topics by bibliometric methods, applying relatively simple statistical techniques as growth curves, or more sophisticated ones, such as automatic classification and analysis of networks [21] [4] [19] [13].Another concern is also to provide tools capable of producing outputs exploitable by the end user.These outputs should be descriptive, intelligible and viewable.
Therefore, we separate our analysis into two parts.In the first part, we describe the quantitative and automatic methods that allow the extraction of relevant information from a corpus.In particular, these techniques offer to detect characteristic keywords from documents, or underlying topics -and keywords that describe them -referred in the documents.We also discuss of the visual exploitation which may be made of these methods.In a second part, we detail the methods for studying topic changes within a corpus whose data are anchored in time.We insist in particular on diachronic analysis methods, particularly effective to track these changes in a step by step and in a synthetic way.
Finally, in a last part, we detail Diachronic'Explorer, our open source tool for the production and viewing of diachronic analysis results.We show the effectiveness of the extraction methods that we offer through complementary and dynamic visualizations using recent technologies.Topic identification is a technique which consists in understanding the meaning of the content of the documents of a corpus in an unsupervised way (without prior knowledge on the corpus and without human intervention) by isolating the topics underlying this content.These topics are usually represented by coordinated phrases and are often ranked in order of importance in the documents.As shown in [5], many techniques can be applied for topic identification and they might exploit research issued from several different communities, such as data mining, computational linguistics and information retrieval.We present hereafter two different types of identification techniques and their related visualization tools: the first is a widely used state of the art technique, the second is an alternative technique that we propose.

LDA
The method The LDA method is a probabilistic method for topic extraction who considers that the underlying topics of a corpus of documents can be characterized by multinomial distributions of words present in the documents [2].According to this principle, each document is then considered as a composition of the topics extracted of the studied corpus.Figure 1 presents a list of topics produced by LDA, and their manifestation in a document.LDA uses a Dirichlet law to allow a careful choice of the parameters of the multinomial distributions.In practice, the extraction of these parameters is however complex and costly regarding computation time.It requires to exploit expectation maximization algorithms [8], which are prone to produce sub-optimal solutions, and in particular trivial or general results that are not usable in many cases, as in the context of the diachronic analysis (Section 3).This last type of analysis, which aims at comparing topics evolving over time, indeed requires working with accurate topic descriptions to isolate changes.Finally, the importance of the words in the documents can only be estimated in an indirect way by LDA and the method does not work on isolated documents.We present later a method based on feature maximization metric we have developed that does not have these drawbacks.
Vizualization using LDA In LDAExplore [6], the authors use a Treemap (Fig. 2) to represent the distribution of the importance of keywords in topics learned by the LDA.The representation of the topics in each document is also displayed, but in the form of curves where each point x-coordinate is a topic, and its weight is on the ordinate.Guille and Morales offer as a complete library for topic modelling, including LDA, which can also produce visualizations in the form of word clouds or histograms [11].

Feature maximization for feature selection
The method To introduce the feature maximisation metric and process [16], we first use an example.We then explain its use in our application framework.In Table 3, we present sample data collected from a panel of M en (M ) and W omen (W ) described by three features (N ose Size (N ), Hair Length (C), Shoes Size (S)).The problem of supervised classification in computer science is to learn to discriminate the class of M en of the class of W omen automatically by using these features.To achieve this, it is worthwhile for the algorithms to exploit the features that best separate the M en from the W omen.
The process of feature maximization is comparable to a feature selection process.This process is based on the feature F-measure.The feature F-measure F F c (f ) of a feature f associated with a cluster c (M or W in our example) is defined as the harmonic mean of the feature recall F R c (f ) and of the feature predominance F P c (f ), themselves defined as follows: with where W f d represents the weight4 of the feature f for the data d and F c represents all the features present in the dataset associated with the class c.
The feature selection process based on feature maximization can thus be defined as a parameter-free process in which a class feature is characterized by using both his ability to discriminate the class to which it relates (F R c (f ) index) and its ability to faithfully represent the data of this class (F P c (f ) index).Table 4 presents how does operate the calculation of the feature F-measure of the Shoes Size feature to the M en class.
Once the capacity to discriminate a class (F R c (f ) index) and to faithfully represent the data of a class (F P c (f ) index) calculated for each feature (Tab.5), the further step consists in automatically selecting the most relevant features for distinguishing classes.The set S c of features that are characteristic of a given class c belonging to the group of classes C is represented as:  where C /f represents the subset of C in which the f feature is represented.
Finally, the set of all selected features S C is the subset of F defined by: The features that are considered relevant for a given class are the features whose representations are better in that class than their average representation in all classes, and also better than the average representation of all features, as regards to feature F-measure.Thus, features whose feature F-measure is always lower than the overall average are eliminated, and the variable N ose Size is therefore suppressed in our example (0.3 < 0.38 and 0.24 < 0.38).
A complementary step to estimate the contrast may be operated in addition to the first stage of selection.The role of this step is to estimate the information gain produced by a feature on a class.It is proportional to the ratio between the value of the F-measure of a feature in the class and the average value of the F-measure of that feature in all classes.For a feature f belonging to the group of selected features S c from a class c, the gain G c (f ) is expressed as: Finally, active, or descriptive, features of a class are those for which the contrast is greater than 1 in those above.Thus, the selected features are considered active in the classes in which the feature F-measure is higher than the marginal average: -Shoes Size is active in the M en class (0.48 > 0.35), -Hair Length is active in the W omen class (0.66 > 0.53).
Contrast ratio highlights the degree of activity/passivity of the features selected compared to their average F-measure.Table 6 shows how the contrast is calculated on the presented example.In this context, the contrast may thus be considered as a function that will virtually have the following effects: -Increase the length of women shair, -Increase the size of the men sshoes, -Reduce the length of the men shair, -Decrease the size of women sshoes.
Preliminary cluster labelling experiments showed that feature maximization metric has discrimination capabilities similar to Chi-squared metric, while with generalization capabilities very appreciably higher [14].Moreover, this technique proved to have very low computation time, unlike LDA.It often has a dual function in learning and visualization, as shown by experiments in [7] or in [16].In the classification context, it can thus optimize performance of the classifiers, while producing class profiles exploitable for the visualization of the content of the classes.Table 7 shows an application of the method on textual data with the goal of establishing discriminative profiles of Chirac and Mitterrand presidents using data of the DEFT 2006 corpus containing around 80000 extracts of their speeches [17].It shows in particular that contrast allows to quantify the influence of features in classes (typical terms of each speaker).Extracted features and their contrast can thus act as profiles of classes for a classifier, as well as indicators of content or meaning for an analyst.
This first example shows how to use feature maximization in the general framework of text datasets.We show more specifically in Section 3 how our feature maximization method can be applied for the more precise and difficult task of topics changes detection in corpora of scientific papers and describe in Section 4 advanced visualizations of that changes.But, to clearly illustrate the potential and the scope of the method, we firstly give hereafter a more specific example, related to synthetic visualization and automatic summary of the individual content of the documents from this method.
Visualization using feature maximization For the synthetic representation of the content in a single document, we propose an original method based on competition between blocks of text, coupled with the feature maximization metric.This approach allows to overcome the lack of metadata to describe the texts.Furthermore, it has the advantage of being independent of the language, to function without external knowledge source and parameters, and is likely to have multiple applications: generation of metadata or input data for the clustering, generation of automatic summaries and explanations of several levels of generality.It consists here in performing indexing of full text papers taking benefit of their structure.As regards to this approach, each part of the paper can be seen as a class, the paper itself being a classification: for example, the exploited classes might be:"introduction, methodology, state of the art, results, conclusion... ".
This paper includes the following major parts: Introduction / Methods / Results / Discussion (Fig. 8). 5 The ISTEX project (Initiative d'Excellence pour l'Information Scientifique et Technique) fits in the "Investment for the future" program, initiated by the French Ministry of Higher Education and Research (MESR), whose ambition is to strengthen research and French higher education on the world level.The ISTEX project main objective is to offer to the whole of the community of higher education and research, online access to the retrospective collections of scientific literature in all disciplines by engaging a national policy of massive acquisition of documentation: archives of journals, databases, corpus of texts.
Reference: http://www.istex.frAfter extraction of the terms by a conventional PoS tagging method, the feature maximization method described Section 2.2 allows to obtain a list of specific terms for each part of the paper, weighted by their importance.Frrom that, it is possible to build up a vectorial (i.e.Bag-of-Words) representation of the paper, or alternatively, to build up a weighted graph (paper parts/selected terms) that will illustrate clearly the scientific contents of the text (Fig. 9).
If we follow an approach of automatic summarization [10], each selected term being weighted for each identified part of the paper, it is easy to balance the sentences containing these terms by adding their weight.We furthermore assign an additional weight to terms that are also part of the title of the paper.The curve of the weights of the sentences thus calculated for each of them always shows a plateau (Fig. 10) ; then, we choose to keep the sentences whose weight is greater or equal to this level and reorder them by rank of appearance in the text.We then have a summary obtained by extraction of meaningful sentences of text that has generally small size (less than 12 sentences in all our experiments).For the paper used as an example, the summary is described on Figure 11. 3 Visualization of evolving data

Visualization methods
In Neviewer, Wang et al. use alluvial diagrams (Fig. 12), sometimes called Sankey visualization [28].These visualizations were also used in [27] to view the changes in the citations between scientific disciplines.
Ratinaud uses dendogram to visualize the various topics (and their vicinities) mentioned on Twitter with the hashtag #mariagepourtous [26].It is nonetheless Treemaps (Section 2, Fig. 2) that enable him to show the progression or regression of topics in time.On their own side, Osborne and Motta use graphics with stacked areas to follow the evolution of the amount of publications grouped into topics across time [22].
These methods have all of the interesting benefits and we show their complementary exploitation use in Section 4.

Diachronic analysis
Diachronic analysis, which consists in comparing data or results by time step, is extensively used.In linguistics, Perea uses this technique to follow the evolution of the Catalan language through time [25].Cardon and al. study the evolution of blogs and their importance on the web in a diachronic way [3].Similarly, the activity of bloggers and the evolution of their interests are studied in a diachronic way by [12].The work of Wang et al, more connected to our field of applications, analyzes the thematic evolution of the research in such a way [28].
In our case, we develop a parameter-free method, directly exploitable by the user and based on feature maximization [16] to identify and describe the topics of a corpus.This approach allows identification of keywords issued from the full text content of documents which are characteristic of each topics, conversely to methods based on keywords indexed by the publication databases [28] [23].Furthermore, the absence of parameters in the process allows to completely automate the task of indexing the documents passing through the detection of topics and up to the visualization of their evolution.Figure 13 shows the progress of the complete method up to visualization.The whole process is also detailled hereafter: 1. We query a bibliographic database in order to build up a corpus covering several years of publication on a given topic.2. The full text of each obtained document is treated with a conventional PoS tagging tool to extract index terms (keywords).3. The documents and their extracted keywords are then grouped into classes corresponding to their year of publication and a feature selection based on feature maximization is applied on the keywords of each of the document group.Furthermore, a graph figuring out the interactions between keywords and document groups, using weighted links setting the strength of the relationships, is constructed.Thanks to a random walk algorithm (here Walktrap [24]), it is then possible to automatically detect groups of years (time periods) who will then serve as time steps for the diachronic clustering algorithm.Figure 14 gives an example of contrast graph as well as resulting cutting in time periods.

A neural clustering algorithm [9]
, more stable and more efficient than the usual clustering algorithms on the textual data, is applied multiple times, with standard parameters, on the data of each time period by varying the desired number of clusters.Clustering quality criteria that are reliable for textual and multidimensional data are exploited in a further step [18] to isolate an optimal model (ideal number of clusters) for each of the periods.
5. The optimal clustering models of each period are post processed separately using feature maximization so as to extract the salient features of each cluster in each time period.
6.The feature maximization results, as well as the overall clustering results, are transmitted to the diachronic module of the Diachronic'Explorer tool presented in Section 4. This tool implements both diachronic analysis functions, based on unsupervised Bayesian reasoning [15], in order to detect thematic connections and differences between the time periods, as well as many functions of visualization of the results.

Diachronic'Explorer tool
We now introduce Diachronic'Explorer 6 , our open source tool for the production and visual exploitation of diachronic analysis results.Diachronic'Explorer is composed of two modules.The first module allows to use descriptions of the topics produced by the clustering and the feature maximization (Section 3.2) to track these topics over time and detect the changes and similarities between time periods.The second module is dedicated to the visualization of the results produced by the first.Designed in the form of web platform using modern technology, the tool offers the possibility to explore the 6 A demo version of the tool can be found at URL: http://github.com/nicolasdugue/istex-demonstrateurcorpus through various complementary visualizations each representing a different level of granularity in the exploration of this corpus.
We will detail below how the tool can be used from the finest-grained level to more synthetic visualization of the corpus and its evolutions.For that purpose, we will take, as example corpus, the evaluation corpus operated by the French ISTEX project.This corpus comprises 9779 records related to research conducted in the general field of Gerontology/Geriatrics between 1996 and 2010.The result of the analysis steps described in Section 3 on this corpus is a division into three time periods.
First of all, the tool allows to study keywords that are particularly characteristics of the topics and to see the evolution of their importance in time.Figure 15 shows for example the main keywords and their evolution in the above-mentioned corpus.The size of a circle is proportional to the importance of the related keyword in its topic.This size is therefore conditioned by the contrast value produced by the feature maximization process (Section 3.2). Figure 16 shows the description of a topic (cluster) through keywords that are representative.The size of the rectangles is also proportional to the value of contrast.Taking some distance from the corpus, with the Figure 17, the topics of a period can be observed in a global way.In this figure, each column represents a topic, and the cells in the column are the keywords that describe this topic.The size of these cells is conditioned by the relative importance of the related keywords, in terms of contrast, in the topic.The colors represent the intensity of the contrast of the considered keywords.Figure 18 allows take some more distance, and to consider interactions between couples of periods.This visualization provides a detailed representation of the diachronic analysis between two periods.The blue rectangle represents the period prior to that represented by the yellow rectangle.The circles represent topics whose size is determined by the quantity of documents they contain.The label of each circle is currently the most characteristic keyword (i.e. the more contrasted one) of the period (for more details, it is possible to select several labels).In addition, the links between circles represent inter-period topic links.They are characterized by a force, which depends on both the number of keywords shared between the topics of the two periods and the contrast of those shared keywords7 .This force provides the thickness of the link.The yellow rectangle below details similarities for each link between two topics.The dissimilarities between topics can be displayed as well, but there are not shown in the figure .Finally, Figure 19 allows us acquire knowledge of all of the topic links between period existing within the corpus.Each color represents a period and vertical rectangles the topics of this period who have links with other periods.In grey color, we observe the inter-period link between topics.Dotted links indicate specific types of topic connections: one of the two topics has a broader descriptive vocabulary.It is possible to see the details of topics passing the mouse over the rectangles of color.Similarly, details on topics links are available on the grey areas.
To see the corpus content in its entirety, we offer also in Diachronic'Explorer an original method of visualization that shows the information in the form of a contrast graph (Fig. 20).The big circles represent topics, the small ones, the keywords.If a topic is described by a keyword then a link is present between the two circles.This visualization shows so all information in a condensed manner.

Conclusion
In this paper, using a detailed state of the art on work done for analysis of textual data, and particularly that of scientific data, we have highlighted in a first time, the strategic importance of the diachronic analysis of such data, as well as the difficulties and complexities related to this type of treatment, whether it's the lack of available metadata or the parameters settings and scope problems related to usual methods of analysis, especially those of topic detection.We have also shown that there are many interesting alternatives with regard to the visualization of analysis results.This discussion has enabled us to propose, as a second step, a new methodology of analysis based on feature maximization.This methodology has many benefits to the existing, which be without parameters, to be applicable at different scale levels, from the corpus to the document, with ease of calculation, and finally, to have strong capabilities of synthesis, which makes her results easily interpretable, even for complex problems and large corpora.All these decisive advantages have in particular helped us to create the DiachronicExplorer tool, which, by integrating the unsupervised Bayesian reasoning and feature maximization with many methods of visualization, provides effective solutions to deal with a problem as complex as that of the detection of changes, from the full text, within a large corpus of scientific publications whose topics evolve with time.This indirectly shows that this type of methodology, due to its great flexibility, has a field of application of much wider range than that presented in this paper.Its synthesis capabilities make it indispensable, especially upstream visualization methods, when the representation of the data itself is complex.

Fig. 3 .
Fig. 3. Sample data for supervised classification in Men/Women classes.

Fig. 4 .
Fig. 4. Sample data and computation of feature F-measure for the Shoes Size feature.

Fig. 5 .
Fig. 5. Feature F-measure of the feature and related marginal average.

Fig. 6 .
Fig. 6.Principle of computation of contrast on selected features and obtained results.

Fig. 8 .
Fig. 8. First page of the selected paper.

Fig. 9 .
Fig. 9. Graphical representation of the content of a scientific paper with the use of its structure.

Fig. 10 .
Fig. 10.Distribution of the weights of the sentences.

Fig. 11 .
Fig. 11.Automatic summary of the paper of Figure 8 produced by sentence extraction (most relevant sentences along with their weights ranked in their order of appearance in the text).

Fig. 15 .
Fig. 15.Evolution of the importance of the keywords through the curse of the 3 time periods for the studied corpus.The size of the circles materialize this importance.Contrast values can also be displayed (here for the keyword nursing).

Fig. 16 .
Fig. 16.List of keywords associated to a cluster and ranked by contrast values.Colors materialize the different values.

Fig. 17 .
Fig. 17.Topics of a period described by their associated keywords.Each topic is a column and the height of a cell materializes the importance of the keyword related to this cell.The color represents the intensity of its contrast.

Fig. 18 .
Fig. 18.Diachronic visualization of 2 periods (extract).Circles materialize topics and links between them indicates similarities or slight changes.The strength of a link is materialized by its thickness and by an indicator value (a value of 1 corresponding to a perfect match).Yellow frame (down) details the similarity between the Cluster 8 of the blue period and the Cluster 3 of the yellow period.

Fig. 19 .
Fig. 19.Diachronic visualization of all the periods of a corpus.Each color represents one period and vertical rectangles materialize topics.In grey color, one can observe the links between the topics of the different periods.