Energy and Expenditure Aware Data Replication Strategy

Energy saving is a major challenge for Information Technology (IT) companies that aim to reduce their carbon footprint while providing large scale cloud services. These companies often rely on data replication technique in order to satisfy tenant's objectives, e.g., performance, especially with the increasing volume of data distributed throughout the world. In this paper, we propose a static and multi objective data replication strategy (E2ARS) that aims to reduce both energy consumption and expenditure of the provider. E2ARS leverages on cloud heterogeneity and energy efficient technologies. We first compare different policies of our strategy, from only taking energy consumption into account to only taking expenditure into account. Unsurprisingly, the more you want to reduce the energy consumption, the less you replicate. Then, we compare E2ARS with strategies from the literature. E2ARS reduces both energy consumption and expenditure where those strategies satisfy only one of the two objectives.

Abstract-Energy saving is a major challenge for Information Technology (IT) companies that aim to reduce their carbon footprint while providing large scale cloud services. These companies often rely on data replication technique in order to satisfy tenant's objectives, e.g., performance, especially with the increasing volume of data distributed throughout the world. In this paper, we propose a static and multi objective data replication strategy (E2ARS) that aims to reduce both energy consumption and expenditure of the provider. E2ARS leverages on cloud heterogeneity and energy efficient technologies. We first compare different policies of our strategy, from only taking energy consumption into account to only taking expenditure into account. Unsurprisingly, the more you want to reduce the energy consumption, the less you replicate. Then, we compare E2ARS with strategies from the literature. E2ARS reduces both energy consumption and expenditure where those strategies satisfy only one of the two objectives.
Index Terms-Cloud, Data replication, Provider expenditure, Energy consumption, SLA violation

I. INTRODUCTION
Lately, data produced by humanity increased sharply and are heavily accessed. One way to cope with this is to replicate data. Data replication is a well studied technique that aims to satisfy different objectives like availability and performance. In the Cloud, this permits to reduce energy consumption and costs as it is a good leverage to adapt resources on demand.
In a Cloud context, elasticity is a way to limit overprovisioning by adapting resources automatically. This implies a specific economic model called the Pay-as-you-go model where tenants only pay for what they consume. The price to rent resources is specified in a Service Level Agreement (SLA) which also contains Service Level Objectives (SLO) the provider has to respect. If the provider does not, penalties are applied, mainly by a refund of the rent to the tenant.
Reducing carbon footprint is an issue that companies have to consider, e.g. we can consider they have values (being carbon neutral) or attract customer that are aware of environmental issues. To fulfill this objective, many techniques have been studied like sleep state and consolidation to put useless resources in sleep mode and to avoid over-provisioning.
In this paper, we propose a static data replication strategy that aims to reduce both expenditure and energy consumption while considering performance for tenants. Our strategy is based on an optimization algorithm which uses our introduced energy consumption and expenditure models. It leverages on heterogeneity, sleep states and consolidation. Note that we consider read-only data that will not be updated.
The rest of this paper is organized as follows: we begin in Section II by a state of the art of data replication strategies that takes into account energy consumption or expenditure of the provider. Then, in Section III, we describe our data replication strategy. In Section IV, we validate the proposed strategy through the analysis of experimental results, comparing with proposals from the literature in terms of energy consumption, number of violations and expenditure. Finally, we conclude and draw some lines for future works.

II. STATE OF THE ART
We focus on strategies that take into account the provider expenditures, its energy consumption, or both. In [1], replication is triggered by a SLO violation and applied only if it gives profit for the provider. [2] proposed a resilient and cost-effective data replication strategy by classifying data into 2 groups based on their popularity, placing and compressing data item either on primary or backup servers. [3] has a workflow application model giving which task needs which data. This permits to replicate data when they are needed and if it reduces costs compared to doing nothing. [4] proposed a strategy that estimates the benefit made from a new replica or data migration for each time frame.
On the side of energy consumption, [5] proposed a static data replication strategy that tries to answer availability, service time, load balancing, energy consumption and latency issues through an evolutionary algorithm where each objective is weighted. [6] considers a 3-tier fat tree data center architecture and aims to reduce energy consumption and bandwidth consumption by replicating in the lower level of the hierarchy. [7] classifies data according to their popularity and servers by they power consumption. It replicates hot data (popular) in hot servers (that consumes more) and vice versa.
Finally, few strategies considers both objectives. [8] aims to reduce energy consumption in order to maximize the provider profit. However, this strategy does not model all the costs, e.g. while transferring files. [9] considers a linear combination of energy consumption and expenditure and then use optimization algorithm to reduce this objective function. This strategy cannot let the administrator choose its policy with a clear view. Compared to the literature, our strategy aims to reduce both energy consumption and costs to store, read and replicate data.

A. Notation
Let F be a set of files stored on a set of nodes N . Each file f i ∈ F, 1 ≤ i ≤ z has a size s(f i ). Each node n j ∈ N, 1 ≤ j ≤ m has a storage capacity cp j . Let φ be a matrix of size (z, m) that denotes the placement of the files on the nodes: φ(f i , n j ) is equal to 1 iff f i is on n j , and 0 otherwise. E a denotes the energy consumed to perform the action a, a ∈ A. P r b represents the price of a node action b, b ∈ B. The sets A and B depend on the state of the node. The static state represents the node when it is up but inactive (static). The dynamic states contain actions such as reading a file from a node (read), writing a file to a node (write), transferring data between nodes through the network (network) and storing a file on a node (store). A contains read, write, network and the static states. B contains only store and network actions.
t is the duration the user will rent the nodes, and nbRds is the number of reads the user will do. Table I summarizes the notations.

B. Models and Objective functions
In this part, models are proposed to estimate the provider's expenditure and energy consumption used in the optimization algorithm. These models are divided into 3 kinds of events: replicating, storing, reading. The expenditure model used in the following estimations is based on [1]. To estimate the energy consumption, we modeled each component that is used in the data management of a node. At first, disks energy consumption is modeled based on [10], also used in [11], to model Solid-State Drive. The memory energy consumption is modeled based on [12], regularly updated by technical reports [13]. Finally, the network energy consumption is considered through the Network Interface Card for the nodes, based on [14], and switches, from [15]. In these models, processors are only considered through their idle energy consumption. However, it is worth to say that these models are only used to build the following strategy and are not meant to precisely describe reality. They can be modified as it will not impact the core of the proposed strategy.
1) Replication: First, the impact of replicating data from their original node to a chosen node is modeled. Let N o be the set of nodes included in N storing the original files before the placement strategy occurs. n io represents the original node where f i is stored.
If the chosen node is the original node, no replication is activated and the energy consumption and cost for replication are equal to 0. In other cases, to model the energy consumption, transfer and writing energy consumption have to be considered, given in Eq. 1. (1) Replicating a file implies costs for provider which is mostly represented by the communication cost through the network. The cost is based on the size of the file s(f i ), and the price per megabyte is represented by P r network (n io , n j ), which depends on the placement: if the nodes are in the same data center it will be cheaper than being in 2 different regions. The expenditure model to replicate files is given in Eq. 2: 2) Storage: After writing all replicas, the impact of storing files for a duration t is modeled. We suppose that: (i) nodes cannot be turned off if they store data in order to retrieve data as quick as possible (ii) Empty nodes can be turned off and have a power consumption equal to 0 (iii) adding a file to a node that already stores data might slightly increase its power consumption, but much less compared to waking up a node to store this file, hence we only consider the static energy consumption for storing files.
E static (n j , t) is the static energy consumption to keep n j up for a duration t. To estimate this energy consumption we sum up the static power consumption of each component multiplied by t. The model is described in Eq. 3: On the other hand, storing files implies a cost for the provider. It depends on the size of the file s(f i ) and on P r store (n j ) (price per second and per megabyte) to store data on n j for a duration t. The model is considered in Eq. 4: 3) Read: Finally, the energy consumption and costs linked to the number of reads is modeled.
Let N i be a set of nodes that contains f i and N i the set of nodes that do not contain f i . These sets are built based on φ. Nodes in N i are denoted n j with 1 ≤ j ≤ m i , with m i being the number of replicas of f i . In N i , nodes are denoted n j with 1 ≤ j ≤ m i . Let shortest(n 1 , n 2 ) be equal to 1 if node n 2 is the closest to node n 1 in terms of transfer time and 0 otherwise.
The energy consumption given in Eq. 5 includes the consumption to read a file f i from the closest node n j to the node n j that requested this file (E read (n j , f i )), and the energy consumed by the network to transfer this file between those nodes (E network (n j , n j , f i )).
The reading cost given in Eq. 6 is mainly based on the cost to transfer data to the node n j that requested the file f i from the closest node n j that stores this file. This cost is based on the size of f i and the price to transfer this file P r network (n j , n j ) per megabyte.

4) Multi-Objective Function:
In this part, global models and the objective function used in the optimization algorithm are introduced. As detailed before, t represents the renting time of the user and nbRds is the number of reads the user will do. These parameters are hypothesis made to balance the number of replicas. In fact, the lower the number of reads and the higher the renting time, the lower the number of replicas to reduce long term energy consumption and cost.
Based on previously defined formulas (1), (3), (5), the global energy consumption model is given in Eq. 7: Similarly, the global expenditure model given in Eq. 8 is based on (2), (4), (6): With cp j being the storage capacity of node n j , the size of all files stored on this node cannot be higher than cp j . This constraint is applied to individual nodes, as it will be discussed later it does not occur if the node is considered as a representative of one data center. Following [16], we choose to create a minimum of 2 replicas per file to assure a data availability within a year of more than 99.99%.
Finally, the objective function is given in Eq. 9

C. Energy and Expenditure Aware Strategy
We propose an Energy and Expenditure Aware Replication Strategy (E2ARS). This static data replication strategy can be considered as an initial data placement for a dynamic data replication strategy.
1) Overview on the static data replication algorithm: We consider a transnational cloud provider which provides services through different regions with several data centers in each region. All data centers have different prices and parameters and for the sake of simplicity these nodes are homogeneous inside a data center. A multi-objective optimization algorithm is used to reduce both energy consumption and expenditure. However, including all nodes in the search space would make the processing very long. As the objective function considers the reading expenditure and energy consumption for each node between all other nodes, considering a subset of nodes permits to highly reduce the time to compute this part of the objective function. In order to make the algorithm works in a decent amount of time, a two steps decision process is proposed. The first optimization algorithm chooses on which data center replicas are stored. To do so, a node is chosen to represent the data center (referred to as getRepresentatives(N) in Algorithm 1). The second optimization algorithm chooses where the data will be stored inside the chosen data center. The whole process is given in algorithm 1.

2) Data replication between data centers (lines 1-5):
The first step is the most important because it has to choose between highly different data centers. It has to replicate and place data where the trade-off between expenditure and energy consumption is the most balanced from the database administrators perspective. We choose to find the Pareto front to let the administrator choose between reducing energy consumption and reducing expenditure, or make a balanced choice. The Improved Strength Pareto Evolutionary Algorithm, also known as SPEA2 [17] (referred to as SPEA2 in algorithm 1), is efficient to find this Pareto front. SPEA2 returns a group of individuals from the Pareto front that are as far as possible from each other. The administrator can then choose the policy they want to apply (referred to as chooseIndividual in algorithm 1). Expenditure and energy consumption models are used as objective functions in the fitness calculation. For this first optimization algorithm, we suppose that each data center can store all the data, and consider the constraint of creating at least 2 replicas at this step. [16] highlights the fact that having independently (by reducing correlated node failure) increases data availability. In a performance context, this will reduce a bottleneck induced by an overload of reading and replication requests. 3

) Data placement inside chosen data centers (lines 6-10):
The second step chooses on which node a replica is stored inside each data center. Before processing this optimization algorithm, it has to gather nodes from the data center (referred to as getNodesFromDC in algorithm 1) and files that will be stored in the data center (getFileOnDC in algorithm 1).
As most of the cost differences are between data centers, placing data anywhere inside the data center does not have an impact in terms of expenditure. Unlike reducing energy consumption, that can be done by leveraging on sleep state. To do so, we consider a proportion p of nodes that will store data, and other nodes will be put in sleep state to reduce energy consumption. Using a proportion p that is too low would introduce a bottleneck in a performance context, as a high amount of requests will be done on these nodes. Conversely, using a proportion p that is too high increases the number of activated nodes, implying an increase in terms of energy consumption. If the size of all data to store is higher than the storage capacity of all nodes in the subset, we add the same proportion p of nodes in the subset. Based on our experiments, we chose to use 6.25% of the data center nodes as it is a good trade off between performance and energy consumption. This proportion depends on the data center infrastructure and has to be reevaluated in different situations.
The second optimization algorithm takes as input, the set of files F, the set of files that will be stored in the chosen data center, and the set of nodes in the current data center. This algorithm sorts chosen files by their size, chooses the proportion of nodes to store data, and then starts to place data with a round-robin method inside the subset of nodes. If a node cannot store the file due to a lack of capacity, it passes to the next node.

IV. EXPERIMENTS
A. Experimental Environment 1) Strategies Parameters: First, we highlight the differences between three kinds of policies based on our optimization algorithm. Those policies represent a range on our objective spectrum based on the Pareto front, from only considering energy consumption (E2ARSEC) to only considering expenditure (E2ARSEX). We also tried a balanced policy between energy consumption and expenditure, which will be considered on the following experiments (E2ARS). To choose this policy, we ordered all the proposed individuals by energy consumption, and we chose the middle one. In following experiments, t is set to 30 days, and nbRds to 100.
Then we compare our strategy with other data replication strategies proposed in the literature, and a control strategy used as a baseline. As control replication strategy, we used a strategy that places 3 replicas randomly called 3Rand. From the literature, we compare our strategy with a static replication strategy called MORM [5] that occurs before the experiment starts, with a multi-objective algorithm that considers availability, latency, load balancing, service time and energy consumption. We also compare our strategy with two dynamic strategies: (i) PEPR [1] takes into account the provider profit (with an income of 0.0205$ per cloudlet) and performance. It is triggered by each violation, replicates if it is still profitable. (ii) Boru et al. [6] consider a hierarchical topology with 3 level of databases (central, data center, rack). The replication is triggered by the number of reads reaching a threshold, and the replication occurs on lower level databases (data center, rack) if they consume less energy and bandwidth than the higher level databases (central, data center).
2) Simulation Parameters: In order to compare those strategies, we implemented them on CloudSim [18]. This simulator has been extended by [1] (replication, monetary cost) and by [19] (energy consumption). We added the capacity for nodes to be turned into sleep mode based on [20] in order to reduce the energy consumption. We supposed that storing data blocks the node from sleeping to keep access to the stored data.
In the experiments, 3Rand, E2ARS, MORM and PEPR are considered using an architecture following a peer to peer topology, and Boru et al.'s architecture is based on a three tier fat tree topology.
We consider two kinds of experiments. Small scale with 30 files and 32 nodes per data center (represented by experiment 1 and experiment 2), and large scale with 1024 files and 128 nodes per data center (represented by experiment 3 and experiment 4). Smaller scale experiments are kept as MORM needed too much memory and could not be processed for large scale ones. Then, when the number of nodes and files are higher, we increased the number of cloudlets from 75,000 to 150,000.
For the workload, [21] shows that when there is a post on a social media with a link to Wikipedia, there is an increased interest about the topic, increasing the number of views on the linked Wikipedia page. Then the interest fades away, decreasing the number of access to the topic page on Wikipedia. [22] also highlights the fact that this interest can fade at different speed. Based on this information, the workload models a short interest for a content, implying a sharp increase followed by a slower decrease. To do so, we choose the gamma probabilistic distribution to model the arrival rate of requests (with parameters α = 4 and β = 600). Then we simulate two different workloads, the first one (represented by experiment 2 and experiment 4) where the increase starts at 1h30, and the second one (represented by experiment 1 and  Table II and differences between experiments are summarized in Table III. Also, experiments are processed 25 times and results are shown through means and standard deviations. To compare these policies and strategies, we used 4 metrics: (i) the energy consumed (in MJ) by the Cloud at the end of the experiment, (ii) the total cost for the provider (in $), (iii) the number of created replicas, (iv) the proportion of violations.
A GitHub repository is available 1 to replicate the work and explore different experiments and parameters.

1) Comparison between different policies:
First, we compare different policies from our static data replication strategy, E2ARSEX, E2ARSEC, and E2ARS.
At first, it can be highlighted that the more energy consumption is considered, the less the strategy creates replicas. Based on Table IV, we can see that E2ARSEC is the one having the lowest number of replicas in all experiments, followed by E2ARS and E2ARSEX. These differences have an impact on the number of violations. Table V shows that linked with the number of replicas, E2ARSEX is the one having the lowest number of violations, followed by E2ARS and E2ARSEC in all experiments.
Tables VI and VII show results related to our main objectives, i.e., energy consumption and expenditure. It highlights the fact that differences between E2ARSEC and E2ARSEX in both energy consumption and expenditure. In fact, E2ARSEC has a lower energy consumption than E2ARSEX, but it is more expensive in all experiments. In the middle of those policies, E2ARS consumes a bit more energy and cost a bit less than E2ARSEC.
However, it should be noted that the number of files and nodes have an impact on these results, which impacts their range of possibility. In fact, some results shows that E2ARS can be closer to E2ARSEX or E2ARSEC in terms of performance, energy consumption and expenditure.
2) Comparison between different strategies:  , table VI and table VII show results of each strategy for each experiment in terms of energy consumption and expenditure, respectively. As 3Rand is the control strategy, it will be compared to other strategies to show their advantages and drawbacks. Boru et al. is the one that costs and consumes the most compared to other strategies. MORM, in small scale experiments, is the 2 nd that consumes the most (2 times the energy consumption of 3Rand), but it is the cheapest one (reduction of 75% compared to 3Rand) as it created replicas in all data centers, the cost to transfer files is highly reduced. We also note (from experiment 3) that PEPR failed to create replicas impacting its energy consumption and expenditure. E2ARS and PEPR are very close both in terms of energy consumption (reduction of both 24% compared to 3Rand) and expenditure (reduction of 16% and 25% compared to 3Rand respectively) even PEPR is a dynamic data replication strategy.

V. CONCLUSION
In this paper, we proposed a static data replication strategy called E2ARS that aims to reduce both energy consumption and expenditure compared to other strategies. The proposed strategy takes into account the trade-off between reducing energy consumption and reducing expenditure that is considered and decided by the administrator. E2ARS uses a 2 steps decision process in order to reduce the search space. The first step uses SPEA2 to consider the cloud heterogeneity followed by a second step which is a heuristic that places data on few nodes to leverage technologies like PowerSleep.
We compared our strategy alongside a control strategy that places 3 replicas placed randomly (3Rand). We also compared our strategy to other existing strategies proposed in the literature, both static (MORM) and dynamic (Boru et al., PEPR). Results show that E2ARS reduces the number of violations compared to 3Rand with only a few more replicas and fulfilled its objective to reduce both energy consumption and expenditure compared to the others strategies. Despite E2ARS is a static strategy, its expenditure and the number of violations are only slightly higher compared to dynamic strategies.
For future work, we will use the proposed static data replication strategy as an initial placement for a dynamic data replication strategy that will use more 'intelligent' techniques. Finally, we are currently repeating these experiments on a real architecture.