Parallelising the Computation of Minimal Absent Words

. An absent word of a word y of length n is a word that does not occur in y . It is a minimal absent word if all its proper factors occur in y . Minimal absent words have been computed in genomes of organisms from all domains of life; their computation also provides a fast alternative for measuring approximation in sequence comparison. There exists an O ( n )-time and O ( n )-space algorithm for computing all minimal absent words on a ﬁxed-sized alphabet based on the construction of suf-ﬁx array (Barton et al ., 2014). An implementation of this algorithm was also provided by the authors and is currently the fastest available. In this article, we present a new O ( n )-time and O ( n )-space algorithm for computing all minimal absent words; it has the desirable property that, given the indexing data structure at hand, the computation of minimal absent words can be executed in parallel. Experimental results show that a mul-tiprocessing implementation of this algorithm can accelerate the overall computation by more than a factor of two compared to state-of-the-art approaches. By excluding the indexing data structure construction time, we show that the implementation achieves near-optimal speed-ups.


Introduction
Sequence comparison is an important step in many tasks in bioinformatics. It is fundamental in many applications; from phylogenies reconstruction to the reconstruction of genomes. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or of similarity between sequences, which are generally computed through sequence alignment techniques. An issue with using alignment techniques is that they are computationally expensive, requiring quadratic time in the length of the sequences-a truly sub-quadratic algorithm for this problem seems highly unlikely [1]. This has led to increased research into alignment free techniques [10].
Whole-genome alignments prove computationally intensive and have little biological significance. Hence standard notions for sequence comparison are gradually being complemented and in some cases replaced by alternative ones that refer either implicitly or explicitly to the composition of sequences in terms of their constituent patterns. One such notion is based on comparing the words that are absent in each sequence. A word is an absent word of some sequence if it does not occur in the sequence. Absent words represent a type of negative information: information about what does not occur in the sequence. For instance, considering the words which occur in one sequence but do not in another can be used to detect mutations or other biologically significant events [17].
Given a sequence of length n, the number of absent words of length at most n is exponential in n. However, the number of certain classes of absent words is only linear in n. A minimal absent word of a sequence is an absent word whose proper factors all occur in the sequence. Notice that minimal and shortest absent words [18] are not the same; minimal absent words are a superset of shortest absent words [15]. An upper bound on the number of minimal absent words is known to be O(σn) [6,13], where σ is the size of the alphabet. This suggests that it may be possible to compare sequences in time proportional to their lengths, for a fixed-sized alphabet, instead of proportional to the product of their lengths [10].
Recently, there has been a number of studies on the biological significance of absent words in various species. The most comprehensive study on the significance of absent words is probably [2]; in this, the authors suggest that the deficit of certain subsets of absent words in vertebrates may be explained by the hypermutability of the genome. It was later found in [9] that the compositional biases observed in vertebrates in [2] are not uniform throughout different sets of minimal absent words. Moreover, the analyses in [9] support the hypothesis that minimal absent words are inherited through a common ancestor, in addition to lineage-specific inheritance, only in vertebrates. In [8], the minimal absent words in four human genomes were computed, and it was shown that, as expected, intra-species variations in minimal absent words were lower than inter-species variations. Very recently, in [17], it was shown that there exist three minimal words in the Ebola virus genomes which are absent from human genome. The authors suggest that the identification of such species-specific sequences may prove to be useful for the development of both diagnosis and therapeutics.
From an algorithmic perspective, an O(n)-time and O(n)-space algorithm for computing all minimal absent words on a fixed-sized alphabet based on the construction of suffix automata was presented in [6]. An alternative O(n)-time solution for finding minimal absent words of length at most , such that = O(1), based on the construction of tries of bounded-length factors was presented in [5]. A drawback of these approaches, in practical terms, is that the construction of suffix automata (or of tries) often have a large memory footprint. Hence, an important problem was to be able to compute minimal absent words with more memory-efficient data structures (cf. [4]).
The computation of minimal absent words based on the construction of suffix arrays was considered in [15]; although this algorithm has a linear-time performance in practice, the worst-case time complexity is O(n 2 ). The first O(n)-time and O(n)-space suffix-array-based algorithm was recently presented in [3] to bridge this unpleasant gap. An implementation of this algorithm is currently, and to the best of our knowledge, the fastest available for the computation of minimal absent words. With the continuous efforts in whole-genome sequencing, the computation of minimal absent words remains the main bottleneck in analysing a large set of large genomes [8,9,17]. Hence due to the large amounts of data being produced, it is desirable to further engineer this computation.
Our Contribution. In this article, our contribution is threefold: (a) We present a new O(n)-time and O(n)-space algorithm for computing all minimal absent words on a fixed-sized alphabet; (b) We show that this algorithm has the desirable property that, given the relevant indexing data structure at hand, the computation of minimal absent words can be executed in parallel; and (c) We make available an implementation of this algorithm for shared-memory multiprocessing programming. Experimental results, using real and synthetic data, show that the overall computation is accelerated by more than a factor of two compared to the state of the art. By excluding the indexing data structure construction time, we show that the implementation achieves near-optimal speed-ups. This is important as engineering further the involved indexing data structure construction is an ongoing research topic [16], which is beyond the scope of this article.

Definitions and Notation
To provide an overview of our result and algorithm, we begin with a few definitions from [3]. Let y = y[0]y [1] . . y[n − 1] be a word of length n = |y| over a finite ordered alphabet Σ of size σ = |Σ| = O(1). We denote by y[i .
the factor of y that starts at position i and ends at position j and by ε the empty word, word of length 0. We recall that a prefix of y is a factor that starts at position 0 (y[0 . . j]) and a suffix is a factor that ends at position n − 1 (y[i . . n − 1]), and that a factor of y is a proper factor if it is not the empty word or y itself. Let x be a word of length 0 < m ≤ n. We say that there exists an occurrence of x in y, or, more simply, that x occurs in y, when x is a factor of y. Every occurrence of x can be characterised by a starting position in y. Thus we say that x occurs at the starting position i in y when x = y[i . . i + m − 1]. Opposingly, we say that the word x is an absent word of y if it does not occur in y. The absent word x, m ≥ 2, of y is minimal if and only if all its proper factors occur in y.
We denote by SA the suffix array of y, that is the array of length n of the starting positions of all sorted suffixes of y, i.e. for all 1 ≤ r < n, we have , for all 0 ≤ r, s < n, and 0 otherwise. We denote by LCP the longest common prefix array of y defined by LCP[r] = lcp(r − 1, r), for all 1 ≤ r < n, and LCP[0] = 0. SA [14] and LCP [7] of y can be computed in time and space O(n).
In this article, we consider the following problem.

MinimalAbsentWords
Input: a word y on Σ of length n Output: all tuples < a, (i, j) >, such that word x, defined by In this section, we present algorithm pMAW, a new O(n)-time and O(n)-space algorithm for computing all minimal absent words of a word of length n using arrays SA and LCP. We first start by explaining some useful properties from [15] we use in algorithm pMAW. Then we present our algorithm in detail, and, finally, we show how it can be adapted for parallel computing.

Useful Properties
is an absent word whose proper factors all occur in y; equivalently, both the longest proper suffix and prefix of x occur in y.

Definition 1.
A repeated pair in a word y is a tuple < i, j, w > such that word w occurs in y at starting positions i and j. A repeated pair is right (resp. left) Lemma 2 ( [15]). If awb is a minimal absent word of a word y, where a and b are letters and w a word, then there exist two positions i and j such that < i, j, w > is a maximal repeated pair in y.
By Lemma 2, we can exhaustively compute minimal absent words by examining all the maximal repeated pairs. To compute maximal repeated pairs, we consider all right maximal repeated pairs and check the letters that occur just before.
Definition 3. Given the LCP array of a word of length n, we say that interval Right maximal repeated pairs are given by the suffix array with the notion of LCP-interval. Indeed if positions i and j are in an LCP-interval of depth d then > is a right maximal repeated pair. Analogously, if < i, j, w > is a right maximal repeated pair then i and j are in the same LCP-interval of depth |w|.

Computation of Minimal Absent Words
For the rest of this section we denote minimal absent words by maws. We first pre-compute SA, LCP, and a bit-vector v such that v  If i is a local maximum in the LCP array, then [i − 1, i] is the LCP-interval of LCP-depth LCP[i] that contains i. Consequently our idea is to start the computation at the first local maximum of the LCP array and to visit the surrounding positions in decreasing order of their LCP value. In this process we keep in the array SetLetter the set of letters that occur before the repeated factor. When we reach a local minimum we store its position on the SA array in the stack LifoPos, and the current array SetLetter in the stack LifoSet. We will analyse them once we have visited their whole LCP-interval. In this way, we consider each maximal  repeated pair and infer from them the whole set of maws using Lemma 2. An example of this function is illustrated in Fig. 1. Contrary to MAW [3], the previous linear-time algorithm, in pMAW we do not consider our data structures globally; we rather consider each LCP-interval independently. This important property will allow us to use parallel computations, as shown in Section 3.3.
Overall Complexity. We use arrays SA and LCP, which can be computed in time and space O(n) [14,7]. There also exists a representation which uses n + o(n) bits of storage space and supports rank and select on a bit-vector of size n in constant time [11]. We also use two stacks, LifoPos and LifoSet, where we push and pop O(n) elements, each containing at most σ integers. Thus the whole algorithm requires time and space O(σn). We obtain the following result. The advantages of pMAW over existing works are as follows. It is (provably) linear-time in the worst case as opposed to the one in [15]. Contrary to the lineartime algorithm in [3], we explicitly compute the LCP-intervals. For a given depth, LCP-intervals have no overlap, therefore we can consider them independently.

Parallelisation Scheme
Lemma 5. Let y be a word of length n over an alphabet of size σ and let be the length of the shortest minimal absent word of y. Then the following hold: The alphabet is of size σ; this can happen at most σ − 2 times consecutively.
There are σ k different words of length k; this can happen at most σ k − 1 times.
In the first case, we have an additional sub-case, when SA[ is not a letter of the alphabet Σ, so we have one more position with an LCP value equal to k. Thus, there are at most (σ − 1)σ k pairs (s i , s i+1 ), so there are at most (σ − 1)σ k + 1 positions with an LCP value equal to k. The equality holds if and only if all the words of length k + 1 appear in y, so only if k < − 1 where is the length of the shortest absent word. A minimal absent word is an absent word so ≥ . Let x be a shortest absent word, then all its proper factors occur in y because they are smaller than x, so x is a minimal absent word. Therefore = , the equality holds if and only if k ∈ [0, − 2]. By Lemma 5, the length of the shortest minimal absent word of some word of length n satisfies: − 1 = min{k ≥ 0 : |{s ∈ [0, n − 1] : LCP[s] = k}| < (σ − 1)σ k + 1}. As the alphabet is of size σ, there are σ k distinct words of length k, but a word y of length n has exactly n + 1 − k factors of length k. Thus, if σ k > n + 1 − k there are absent words of size k in y. Consequently we have ≤ log σ (n + 1 − ) < log σ (n). Thus, we compute , the length of the shortest minimal absent word, in one pass over the LCP array by counting the number of positions having an LCP value equal to d, for all d ∈ [0, log σ (n) ].
According to Lemma 2 we can ignore positions having an LCP value lower than − 2 when computing minimal absent words. Hence, we focus on LCPintervals of LCP-depth above or equal to − 2: they are sufficient to exhaustively compute the set of minimal absent words. Consequently we compute Optimal speed-up pMAW speed-up for n = 1Gb pMAW speed-up for n = 100Mb pMAW speed-up for n = 10Mb . This partition is such that, every LCP-interval of LCP-depth above or equal to − 2 is entirely included in one of the sub-intervals [k i , k i+1 ).
Therefore we can consider each one of these sub-intervals independently, and thus parallelise the computation of minimal absent words. In each sub-interval we go through the SA and LCP arrays starting at the first (from left to right) local maximum and going down until we reach a local minimum, as described in Section 3.2. For an overview of the algorithm pMAW inspect Fig. 2.

Experimental Results
We implemented algorithm pMAW as a programme to compute all minimal absent words of a given sequence. The programme was implemented in the C programming language, using Open Multi-Processing (OpenMP) API for sharedmemory multiprocessing programming, and developed under GNU/Linux operating system. It takes as input arguments a file in (Multi)FASTA format and the minimal and maximal length of minimal absent words to be outputted; and then produces a file with all minimal absent words of length within this range as output. There are additional input parameters; for example, the number t of available processing elements. The implementation is distributed under the GNU General Public License (GPL), and it is available at http://github.com/ solonas13/maw, which is set up for maintaining the source code and the manpage documentation. The experiments were conducted on a Desktop PC using 1 to 16 cores of 2 Intel Xeon E5-2670V2 Ten-Core CPUs at 2.50GHz and 256GB of main memory under 64-bit GNU/Linux.
To evaluate the efficiency of our implementation, we compared it against the corresponding performance of MAW [3], which is currently the fastest available implementation for computing minimal absent words. We generated three ran- Optimal speed-up pMAW speed-up for Homo sapiens genome pMAW speed-up for Mus musculus genome Fig. 4: Elapsed-time comparison of pMAW and MAW and relative speed-up of pMAW for computing minimal absent words using real DNA sequences dom sequences of length 10Mbp, 100Mbp, and 1Gbp, respectively, by using a uniform frequency distribution of letters of the DNA alphabet. We computed all minimal absent words of length at most 20 for each sequence. We considered both the 5 → 3 and the 3 → 5 DNA strands. Fig. 3a depicts elapsed-time comparisons of pMAW and MAW, including the sequential part of the algorithm. pMAW becomes the fastest in all cases when t ≥ 2 accelerating the computation by more than a factor of two when t = 16. Notice that the y-axis is on logarithmic scale. The measured relative speed-up of pMAW is illustrated in Fig. 3b. The relative speed-up was calculated as the ratio of the runtime of pMAW on 1 core to the runtime of pMAW on t cores, excluding the sequential part of the algorithm. The results highlight the excellent scalability of pMAW when the letters have a uniform frequency distribution in the sequence. In this case, pMAW achieves near-optimal speed-ups, confirming our theoretical findings.
To further evaluate the efficiency of our implementation, we compared it against the corresponding performance of MAW using real data. We considered the genomes of Homo sapiens and Mus musculus, obtained from the NCBI database (ftp://ftp.ncbi.nih.gov/genomes/). We computed all minimal absent words of length at most 20 of the complete sequence of the Homo sapiens (2, 937, 639, 113bp) and Mus musculus (2, 647, 521, 431bp) genomes-ignoring unknown bases. We considered both the 5 → 3 and the 3 → 5 DNA strands. Fig. 4a depicts elapsed-time comparisons of pMAW and MAW, including the sequential part of the algorithm. pMAW becomes the fastest in all cases when t ≥ 2 accelerating the computation by more than a factor of two when t = 16. Notice that the y-axis is on logarithmic scale. The measured relative speed-up of pMAW is illustrated in Fig. 4b. The relative speed-up was calculated as the ratio of the runtime of pMAW on 1 core to the runtime of pMAW on t cores, excluding the sequential part of the algorithm. The results highlight the good scalability of pMAW with real data. The computation is accelerated by a factor of 10 when t = 16. The maximum allocated memory was 137GB for both programmes.

Final Remarks
The importance of our contribution here is underlined by the fact that any parallel algorithms for the construction of the involved indexing data structure can be used directly to replace the sequential part of the algorithm proposed here (see Fig. 2). This would result in a fully parallel algorithm for the computation of minimal absent words. Our immediate target is to investigate the performance of such an algorithm by using the parallel algorithms presented in [16] for constructing the suffix array and the longest common prefix array.