Enabling Decision Support Through Ranking and Summarization of Association Rules for TOTAL Customers

. Our focus in this experimental analysis paper is to investi-gate existing measures that are available to rank association rules and understand how they can be augmented further to enable real-world decision support as well as providing customers with personalized recommendations. For example, by analyzing receipts of TOTAL customers, one can ﬁnd that, customers who buy windshield wash, also buy engine oil and energy drinks or middle-aged customers from the South of France subscribe to a car wash program. Such actionable insights can immediately guide business decision making, e.g., for product promotion, product recommendation or targeted advertising. We present an analysis of 30 million unique sales receipts, spanning 35 million records, by almost 1 million customers, generated at 3,463 gas stations, over three years. Our ﬁnding is that the 35 commonly used measures to rank association rules, such as Conﬁdence and Piatetsky-Shapiro, can be summarized into 5 synthesized clusters based on similarity in their rankings. We then use one representative measure in each cluster to run a user study with a data scientist and a product manager at TOTAL. Our analysis draws actionable insights to enable decision support for TOTAL decision mak-ers: rules that favor Conﬁdence are best to determine which products to recommend and rules that favor Recall are well-suited to ﬁnd customer segments to target. Finally, we present how association rules using the representative measures can be used to provide customers with personalized product recommendations.


Introduction
Association rule mining [1] is one of the most frequently used techniques to analyze customers' shopping behavior and derive actionable insights to enable Our work is funuded by a grant from TOTAL decision support. Like many others in the retail industry, marketers and product managers at TOTAL conduct regular studies of customer preferences and purchasing habits. The goal of those studies is to determine two main decisions: which products to bundle together in a promotional offer and which customers to target. Those studies usually focus on unveiling the interest of customers for specific products or categories (e.g., tire service, gas, food) or the behavior of pre-defined customer segments. However, when the underlying dataset is extremely large, such as the one we use for our analysis from TOTAL (30 million receipts spanning 35 million records), it can create an explosion of association rules; therefore one has to make use of existing ranking measures of association rules, such as, Support, Confidence, Piatetsky-Shapiro, Lift, etc to rank the rules. Even after that, as there exists many ranking measures (as many as 35) [4,12], there may not be enough guideline to understand which ranking measure is to be leveraged for what types of decision making task, unless these ranking measures are further summarized.
To address that, we leverage the power of association rule mining and ranking measures for marketers to extract actionable insights from large volumes of consumer data. To make the outcome tightly aligned with the practitioners need, our workflow consists of the following 4 steps: Step 1: We empower nonscientist domain experts with the ability to express and analyze association rules of interest.
Step 2: we summarize the ranking measures into a set of synthesized clusters or groups. The outcome of this process is a 5 synthesized clusters (or groups) that summarize the ranking measures effectively.
Step 3: We allow the domain expert non-scientists to provide feedback on the synthesized clusters.
Step 4: We show how this process can provide actionable insights and enable decision support for virtually any customer segment and any product. Our analysis shows: rules that favor Confidence are best to determine which products to promote and rules that favor Recall are well-suited to find customer segments to target.
To the best of our knowledge, this work is the first to run a large-scale empirical evaluation of insights on customer purchasing habits in the oil and gas retail domain. We summarize our contributions as follows : (i) a reproducible methodology for experimenting with different association rule ranking methods; (ii) several insights on real large-scale datasets; (iii) how to use association rules and interestingness measures in computing recommendations.

Empowering domain experts
When analysts seek to determine which products to run a promotion for or which customers to target, they conduct small to medium-scale market analysis studies. Such studies are expensive, time-consuming and hardly reproducible. We use association rule mining to unveil valuable information about any customer segment and any product. Our collaboration with analysts at TOTAL resulted in the formalization of two kinds of purchasing patterns: those representing associations between a set of products and a single product (customers who wash their cars and purchase wipes also purchase a windshield washer), and those associating customer segments to a product category (young customers in the south of France who frequently wash their cars).
Our dataset contains 30 million unique receipts, spanning 35 million records, generated at by 1 million customers at 3,463 gas stations , over three years (from January 2017 to December 2019). The ratio 30/35 is due to the fact that, unlike in regular retail such as shopping grocery stores [12], most customers at gas stations purchase gas only, and a few purchase additional products such as car wash, drinks and food items.
Based on our initial discussion with TOTAL analysts, we propose two mining scenarios to capture desired purchasing patterns. The goal is to help analysts who are not necessarily tech-savvy, express their needs. In the first scenario, prod assoc, the analyst specifies a target product and expects rules of the form set of products → target product. In the second scenario, demo assoc, the analyst specifies a target product category and expects rules of the form customer segment → category, i.e. customers who purchase products in that category. Each scenario requires to ingest and prepare data as a set of transactions. The transactions are fed to j LCM [16], our open-source parallel and distributed pattern mining algorithm that runs on MapReduce [13], to compute association rules. To cope with the skewed distribution of our transactions, j LCM is parameterizable and is used to mine per-item top-k itemsets.

Ranking and summarization of association rules
Regardless of the mining scenario, the number of resulting rules can quickly become overwhelming. As an example, for a single target product: TOTAL wash and with a 1,000 minimum support, j LCM mines 4,243 frequent rules of the form set of products → TOTAL wash. Out of these, 805 have a Confidence of 50% or higher. Table 1 shows a ranking of the top-5 rules for the product category Lubricants and the top-5 rules for the product Coca Cola , sorted using 2 different interestingness measures proposed in the literature [4]. Given the rule A → B, Confidence is akin to precision and is defined as the probability to observe B given that we observed A, i.e., P (B|A). Piatetsky-Shapiro [22] combines how A and B occur together with how they would if they were independent, i.e., P (AB) −P (A)P (B). Recall is defined as the probability to observe A given that we observed B. Clearly, different measures yield different rule rankings for both prod assoc and demo assoc.
To ease the burden on analysts, we propose to examine the rankings induced by existing measures (exactly 35 measures [4] ) and attempt to reduce them based on similarities in rankings. We run our measures to rank association rules for 228 representative products in prod assoc and for 16 representative product categories in demo assoc. In each case, we use hierarchical clustering to summarize or group the rule rankings based on their similarities (we use multiple list similarities to compare rankings). Our finding is that existing measures can be clustered into 5 similar synthesized groups regardless of the mining scenario. The clusters we obtained are summarized in Table 3. They differ in their emphasis on

Gathering feedback
The reduction of the number of interestingness measures to rank rules enabled us to conduct a user study with 2 analysts, one data scientist and one product manager (co-authors of this paper), at TOTAL to address the following question: out of the 5 groups of similar interestingness measures, which ones return actionable rules? Actionable rules are ones that can be used by analysts either to promote products or to find customer segments to target. Our study lets analysts compare 2 (hidden) ranking measures at a time for a given scenario and a given target product or category. Our first deployment was deemed "reassuring" and "unsurprising". A joint examination of the results identified two issues: (1) rules contained many "expected associations", i.e., those resulting from promotional offers that already occurred; (2) many rules were featuring "familiar" items, i.e., frequently purchased ones. After filtering unwanted items such as gas, plastic bags, etc and offers, we ran a second deployment with our analysts. Their interactions with returned association rules (of the form A → B, where B is a product or a category) were observed and their feedback recorded. This deployment yielded two insights: rankings that favor Confidence, i.e., P (B|A), are best to determine which products to promote while rankings that favor Recall, i.e., P (A|B), are well-suited to find which customer segments to target. Confidence represents how often the consequent is present when the antecedent is, that is, P (B|A), and confidence-based ranking can be used to determine which A products to bundle with a target product B to promote B. Recall represents the proportion of target items that can be retrieved by a rule, that is, P (A|B), and recall-based ranking can be used to determine which customer segments A to target with B.

Product recommendation
Finally, we show how association rules can effectively be used to perform product recommendation using different interestingness measures. Clustering the overwhelming number of interestingness measures into 5 synthesized clusters enabled us to conduct an offline experiment to test the effectiveness of each cluster of measures to generate accurate product recommendations. We split our data using the available timestamps into a training set (transactions from January 2017 to December 2018) and a test set (transactions from January 2019 to December 2019), i.e., we extract association rules based on past purchases to predict future purchases. The obtained accuracy results are consistent with our clustering as well as the preference of our analysts for measures that favor Confidence for product recommendation. In summary, this paper presents a joint effort between researchers in academia and analysts at TOTAL. We leverage the power of association rule mining and augment them with the power of rule ranking and summarization to guide decision support as well as the ability of performing product recommendation. The rest of the paper could be summarized as follows: The background and the goal of the work are provided in Section 2. Our underlying process using TOTAL datasets is described in Section 3. In Section 4, we describe how we summarize (cluster) interestingness measures based on similarities in rule rankings. These clusters are then evaluated by analysts in Section 5 leading to insightful findings. We discuss how to turn use our findings into product recommendation through association rule ranking in Section 6. The related work is summarized in Section 7. We conclude in Section 8.

Background and Overall Goal
We describe the TOTAL dataset, the mining scenarios, and interestingness measures used to rank association rules, and finally we state our goal.

Dataset
Our dataset represents customers purchasing products at different gas stations that are geographically distributed in France, for a period of two years (from January 2017 to December 2018). The dataset D is a set of records of the form t, c, p , where t is a unique receipt identifier, c is a customer, and p is a product purchased by c. The set of all receipt identifiers is denoted T . Each receipt identifier is associated with a unique customer, and multiple receipt identifiers can be associated with the same customer according to his/her visits to different gas stations. When a customer purchases multiple products in the same visit to a gas station, several records with the same receipt identifier t are generated. The complete dataset contains over 30 million unique receipts, spanning 35 million records, generated at 3, 463 gas stations, over three years. The ratio 30/35 in our dataset is due to the fact that, unlike regular retail such as as shopping grocery stores [12], most customers at gas stations purchase gas only, and a few of them purchase additional products such as car services (oil change, car wash), drinks and food items.
The set of customers, C, contains over 1 million unique records. Each customer has demographic attributes. In this study, we focus on 3 attributes: age, gender and location. The attribute age takes values in {< 35, 35 − 49, 50 − 65, > 65} and the attribute location admits French regions as values. We use demographics(c) to refer to the set of attribute values a customer c belongs to. For example, { < 35,F,Ile-de-France} represents a 28 years old female from the Ile-de-France region, whom we will refer to as Mary. The attributes are used to form customer segments. Each segment is described by a set of user attribute values that are interpreted in the usual conjunctive manner. For example, the segment {< 35 , * , Ile-de-France} refers to young customers from the Ile-de-France region and the segment {> 65 , M, Normandie} refers to Senior Male customers from the Normandie region.
The set of products P contains over 37, 556 entries, out of which 976 have been sold more than a thousand times. Each product p is associated with a product category. our dataset contains 54 different categories including gas, lubricants, car wash, hot drinks, and sweets. We use cat(p) to denote the category of a product p.

Mining Customer Receipts
We describe our data preparation process -that is how to translate the sale receipts to a transactional dataset that could be further injected to the mining process. We then describe the mining scenarios and present interestingness measures to rank association rules. Figures 1, 2 and 3 report statistics on one month in the dataset which contains 407, 212 sales records generated by 257, 102 customers for 5, 479 products at 3, 079 gas stations. For confidentiality reasons, we do not report the statistics of the full dataset. We can however state that other periods Table 2. Our mining scenarios and example association rules.

Target Associations
Associations and T demo assoc: {demo(c) ∪ cat(p)| t, c, p ∈ D} segment → category min support is 1,000 prod assoc: {∪ t j ,c,p i ∈D pi|c ∈ C} product(s) → product min support is 1,000 Target Associations Desired Association Rules demo assoc: A segment of customers who are likely to purchase products in a given category segment → category {< 35, F, * } → car wash prod assoc: Customers who purchase a set of products and are likely to purchase the target product product(s) → product {Bbq Chips, Snickers Bar}→ Coca Cola in the dataset exhibit similar distributions. The statistics clearly show that the most purchased items are gas and that most transactions are short.
To gain an understanding of customers' buying habits and provide them with relevant offers, analysts from TOTAL are interested in studying two kinds of purchasing patterns: those associating a set of products to a single product (customers who wash their cars and purchase wipes also purchase a windshield washer) and those representing associations between customer segments and a product category (young customers in the south of France who frequently wash their cars). In all cases the analyst specifies a rule target B which corresponds to a product or a product category, and expects rules of the form A → B.
In the first scenario, that we denote prod assoc, the analyst specifies a target product and is shown rules of the form set of products → target product, i.e. customers who purchase the set of products are likely to purchase the target product. In the second scenario, that we denote demo assoc, the analyst specifies a target category, and is shown rules of the form customer segment → target category, i.e. customers who belong to some segment are likely to purchase products in the target category.
In both scenarios, the original dataset D is mapped into a collection of transactions T that is given as input to the mining process, as summarized in Table 2. The set T is built differently according to each scenario.
In the first scenario prod assoc, we generate the set of transactions T by grouping records in D by customer identifiers. For each customer c, we generate a single transaction containing the set of all products ever purchased by c {p| t, c, p ∈ D}. We obtain |C| transactions, each of which is a subset of P. This enables the discovery of customer patterns occurring over several visits to a station.The number of transactions in prod assoc is 1, 083, 901, where each transaction contains 7 products on average.
In the second scenario demo assoc, a transaction is a tuple built for each record t, c, p by associating the customer segment demographics(c) with the corresponding category of the product cat(p). For example, an entry in the raw data consisting of the record 4523768, Mary, tea is mapped to the transaction < 35, F, Ile-de-France, hot drinks . We obtain |D| transactions, and each transaction contains the segment a customer belongs to, and the category of the purchased product.
Mining Scenarios Searching for regularities in a dataset plays an essential role in data mining tasks that retrieve interesting patterns. Frequent itemset mining is the task of identifying sets of items which often occur together in the dataset. Given a frequency threshold ε ∈ [1, n], an itemset P is said to be frequent in a transactions set T iff support T (P ) ≥ ε where support T (P ) is the number of transactions in T that contain simultaneously all items in P . As indicated in Table 2, we set the frequency threshold to 1, 000 in both scenarios. Because marketing actions are decided and applied nation-wide, they are expected to concern at least 1,000 customers.
An itemset P is closed if and only if there exists no itemset P ⊃ P such that support T (P ) = support T (P ) [20]. The number of closed itemsets can be orders of magnitude less important than the number of itemsets, while providing Table 3. Interestingness measures of a rule A → B. ♦, †, , ⊗ indicate measures that always produce the same rule ranking. |T | is the number of transactions.

Kappa
Average confidence Piatetsky-Shapiro High recall Two-Way Support Specificity

Recall
Lowest confidence Collective Strength G5 Highest recall the same amount of information on T . Several algorithms, including ours, focus on extracting frequent closed itemsets, increasing performance and avoiding redundancy in results [21,30]. We consider our 2 mining scenarios described in Section 2.2. Each scenario leads to the construction of a different collection of transactions T , where a transaction is a set of items. Given T , a frequency threshold ε, we retrieve all closed frequent itemsets, and use them to derive association rules [28]. Each itemset P implies an association rule of the form A → B where A, B is a partition of P . A is the antecedent of the rule, and B its consequent. In prod assoc, A is a set of products (A ⊆ P) and B is a product. In demo assoc, A is a customer segment and B is a product category. Analysts generally focus on particular products or product categories. That is why they specify the targets that they are interested in each scenario. Table 2 contains example association rules extracted from our dataset.

Interestingness Measures
The ability to identify valuable rules is of utmost importance to avoid drowning analysts in useless information. Association rules A → B were originally selected using thresholds for Support (support T (A ∪ B)) and Confidence ( support T (A∪B) support T (A) ) [1]. However, using two separate values, and guessing the right threshold is not natural. Furthermore, support and confidence do not always coincide with the interest of analysts. Hence, a number of interestingness measures that serve different analyses were proposed in the literature [4,19]. Table 3 summarizes the measures we use in this work. The first column contains the name of the measure, the second its expression. Table 4 describes the group and description of each measure and will refer to it later in the paper.

Goal
Our goal is to help analysts test and compare the rankings produced by different interestingness measures on rules extracted from D. An analyst can specify one of 2 mining scenarios, prod assoc and demo assoc, and one or several targets (products in the case of prod assoc, categories in the case of demo assoc), and the system generates as many rule rankings as the number of interestingness measures.

Acquisition and storage
Each of the 3, 463 gas stations maintains a log of all customer transactions completed during one day. Whenever a customer authenticates her purchases using her loyalty card, a receipt containing the list of purchased products, their price, their category, as well as potential promotional offers, is generated. For each purchased product a record containing the receipt id, product id and customer id, is generated. These receipts are logged as r, c, p triples and stored in write-ahead log. Once a day, at closing time of each gas station, this log is transferred to the main data store. We have access to an SQL database containing the sales table where sales records are stored. Each customer is an entry in the customers table, which records the information she provided in her loyalty card (age, gender, region). Note that we do not have access to confidential information such name and phone number.

Data curation and preparation
We first query the sales table to retrieve the full raw sales records. We also query the customers table to retrieve for each customer the corresponding segment attributes. At the end of this step, we generate two text files. Each line in the sales file is a triple r, c, p , and each line in the customers file is a quadruple c, age, gender, region . As described in Section 2.2, mining customer receipts starts with the construction of a transactions dataset T according to the mining scenario specified by the analyst. We rely on Apache Spark and MapReduce operations to build the dataset T for each mining scenario. The sales file is loaded as a resilient distributed dataset. We maintain a HashMap that associates to each customer her segment, and another HashMap that associates to each product its corresponding category. In the case of prod assoc, the products bought by a given customer are grouped by customer identifier using a groupByKey operation. In the case of demo assoc, a single map operation is sufficient. For each row r, c, p in the dataset, the map operation constructs a transaction age, gender, region, cat(p) .
In both cases, a dataset T is created as a text file, with one line per transaction. In prod assoc, an example of a line is gas, car wash, cafe, sandwich that represents all products ever purchased by a single customer. In demo assoc, an example of a line is > 65 , M , Ile-de-France, Soft drinks . Given a dataset T , we can now perform the mining process.

Mining
Extracting itemsets using jLCM Generating association rules, presented in Section 2.2, requires to first extract frequent itemsets from T . We use j LCM [16], our open-source parallel and distributed pattern mining algorithm that runs on MapReduce [13]. Mining frequent itemsets is done in two steps. We scan the input dataset T once and build a filtered dataset limited to transactions containing the target B specified by the analyst: T B = {E ∈ T , B ∈ E}. Then, we execute j LCM on the filtered dataset. j LCM is a recursive algorithm that retrieves frequent itemsets and computes their frequency. Closed itemsets are returned along with their corresponding support, except for singletons that cannot be used to produce association rules. This extraction allows us to quickly obtain itemsets that satisfy our constraint., i.e, all extracted itemsets contain the specified target B.
Mining rules Our analysts aim at uncovering interesting association rules expressed as A → B where B is the specified target. Evaluating the interestingness of an association rule requires computing the support of itemsets A, B and A∪B in T . The standard method for mining association rules consists in finding all frequent itemsets in the dataset, and then generating the rules. Given that our analyst specifies a single target B at a time, this approach would be wasteful. This motivates using j LCM on the filtered dataset limited to transactions containing the target B. The result of the itemsets extraction using j LCM contains the support of B and A ∪ B for all association rules we are interested in. At this point, we need to calculate the support of each antecedent itemset A. Thus, in a post-processing step, we scan the dataset T once and compute the support of all antecedents A. This two-step approach avoids the computation of many itemsets that will never appear as a rule antecedent.
Evaluating relevant rules To evaluate the interestingness of an association rule A → B, we only need to compute P (A), P (B) and P (A ∪ B) because given the number of all transactions |T |, other probabilities such as P (B|A) and P (A¬B) can be derived. Therefore, we denormalize the results of the mining phase to store those three probabilities with each A and B. The support of all rules' antecedents (used to compute P (A)) are added to the results of the mining phase (used to compute P (B) and P (A ∪ B)). We create a dataframe where each row represents an association rule and has enough information to compute its interestingness. For instance, in the case of prod assoc, the system computes three values for each rule. As an example, for a rule like Coffee → Water, it computes 3 values: Support (number of customers who purchased both Coffee and Water ), Confidence (fraction of Coffee buyers who also bought Water ) and Recall (fraction of Water buyers who also bought Coffee). This dataframe is augmented with 35 columns, one for each implemented measure listed in Table 3.

Ranking and summarization
Our goal, stated in Section 2.3, is to assist analysts in selecting the most actionable rules, those that can be used to promote products or target specific customers. In this section, we present an empirical evaluation of the 35 measures for association rules introduced in Section 2.2. The main goal of our evaluation is to compare the rankings of association rules produced by those measures on our dataset, and study their similarities. This lets us summarize ranking measures into similar clusters. We explain obtained clusters in Sections 4.2 and 4.3 and discuss their differences. This empirical evaluation automatically reduces the number of candidate measures to present to analysts in the user study.

Ranking similarity measures
We rely on the methods used in [12] to compare ranked lists of rules produced by different interestingness measures. The first three methods are taken from the literature. The last one NDCC is a parameter-free measure defined in [12] to emphasize differences at the top of the rankings.
We are given a set of association rules R to rank. Each measure, m, is seen as a function that receives a rule and generates a score, m : R → R. We use L m R to denote an ordered list composed of rules in R, sorted by decreasing score. Thus, L m R =< r 1 , r 2 , . . . > s.t. ∀i > i m(r i ) < m(r i ). We generate multiple lists, one for each measure m, from the same set R. L m R denotes a ranked list of association rules according to measure m where the rank of rule r is given as rank(r, L m R ) = |{r |r ∈ R, m(r ) ≥ m(r)}|. To assess dissimilarity between two measures, m and m , we compute dissimilarity between their ranked lists, L m R and L m R . We use r m as a shorthand notation for rank(r, L m R ).
Spearman's rank correlation coefficient Given two ranked lists L m R and L m R , Spearman's rank correlation [3] computes a linear correlation coefficient that varies between 1 (identical lists) and −1 (opposite rankings) as shown below.
This coefficient depends only on the difference in ranks of the element (rule) in the two lists, and not on the ranks themselves. Hence, the penalization is the same for differences occurring at the beginning or at the end of the lists.
Kendall's τ rank correlation coefficient Kendall's τ rank correlation coefficient [10] is based on the idea of agreement among element (rule) pairs. A rule pair is said to be concordant if their order is the same in L m R and L m R , and discordant otherwise. τ computes the difference between the number of concordant and discordant pairs and divides by the total number of pairs as shown below.
Similar to Spearman's, τ varies between 1 and −1, and penalizes uniformly across all positions.
Overlap@k Overlap@k is another method for ranked lists comparison widely used in Information Retrieval. It is based on the premise that in long ranked lists, the analyst is only expected to look at the top few results that are highly ranked. While Spearman and τ account for all elements uniformly, Overlap@k compares two rankings by computing the overlap between their top-k elements only. Normalized Discounted Correlation Coefficient Overlap@k, Spearman's and τ sit at two different extremes. The former is conservative in that it takes into consideration only the top k elements of the list and the latter two take too liberal an approach by penalizing all parts of the lists uniformly. In practice, we aim for a good tradeoff between these extremes.
To bridge this gap, we use NDCC (Normalized Discounted Correlation Coefficient), a ranking correlation measure proposed in [12]. NDCC draws inspiration from NDCG, Normalized Discounted Cumulative Gain [9], a ranking measure commonly used in Information Retrieval. The core idea in NDCG is to reward a ranked list L m R for placing an element r of relevance rel r by relr log r m . The logarithmic part acts as a smoothing discount rate representing the fact that as the rank increases, the analyst is less likely to observe r. In our setting, there is no ground truth to properly assess rel r . Instead, we use the ranking assigned by m as a relevance measure for r, with an identical logarithmic discount. When summing over all of R, we obtain DCC , which presents the advantage of being a symmetric correlation measure between two rankings L m We compute NDCC by normalizing DCC between 1 (identical rankings) and −1 (reversed rankings).
Rankings comparison by example We illustrate similarities between all ranking correlation measures with an example in Table 5. This shows correlation of a ranking L 1 with 3 others, according to each measure. NDCC does indeed penalize differences at higher ranks, and is more tolerant at lower ranks.
We perform a comparative analysis of the 35 interestingness measures applied to our two mining scenarios summarized in Table 2. We report the results of this comparison for prod assoc in Section 4.2 and for demo assoc in Section 4.3. Overall we identify 5 clusters of similar interestingness measures with some differences between the two scenarios. This confirms the need for a data-driven clustering of interestingness measures in each scenario.

Rankings comparison for prod assoc
For prod assoc, we generate a set of association rules A → B, where B is a single product among a set of 228 representative products that were selected by our analysts. For each product B, analysts seek to make one of two decisions: which products A to bundle B with in an offer, and who to target for product B (customers who purchase products in A). Overall we obtain 253, 334 association rules. We compute one rule ranking per target product and per interestingness measure.
While all measures are computed differently, we notice that some of them always produce the same ranking of association rules. We identify them in Table  3 using special symbols. For example, it is easy to see that Information gain = log 2 (Lift). Information gain is a monotonically increasing transformation of Lift, so they are returning exactly the same rankings. It is also easy to see that Loevinger = 1 − 1 Conviction . Thus the higher the rank of any association rule r according to Conviction, the higher its rank according to Loevinger, which leads to the exactly same rule rankings for these two measures. In addition, some of the measures that always return the same rule rankings can be easily explained analytically. Since our analyst specifies a single target product at a time, for a given ranking P (B) is constant, which eliminates some of the differences between the considered interestingness measures. We provide on Section 4.5 a discussion about the existing relationships between all the studied measures.
Comparative analysis We now evaluate the correlation between interestingness measures that do not return the same rankings. We compute a correlation matrix of all rankings according to each correlation measure described in Section 4.1, and average them over the 228 target products that were chosen by analysts. This gives us a ranking correlation between all pairs of measures. The correlation matrix is then transformed to a distance matrix M, i.e., the higher the correlation, the smaller the distance. Given the distance matrix M, we can proceed to cluster interestingness measures. We choose to use hierarchical agglomerative clustering with average linkage [27]. Indeed, one of the advantages of hierarchical clustering is that it produces a complete sequence of nested clusterings, by starting with each measure in its own cluster and successively merging the two closest clusters into a new cluster until a single cluster containing all of the measures is obtained. For our hierarchical clustering implementation, we rely  Fig. 4. Summarization of interestingness measures through hierarchical clustering for prod assoc (clusters are described in Table 3) on the cluster.hierarchy function available from the scipy statistics package of Python. We obtain a dendrogram of interestingness measures and analyze their similarities. The dendrograms for N DCC and τ are presented in Figure 4. Figure 5 shows the complete dendrogram for all interestingness measures using hierarchical clustering. To describe the results more easily, we partition the interestingness measures into 5 clusters, as indicated in the third column in Table 3. G 1 is by far the largest cluster and contains 18 measures (among which Lift, Confidence, Added value) that produce very similar rankings, among them 6 clusters of measures always generate the same rankings. A second cluster G 2 comprising 3 measures (Accuracy, Gini index, Least contradiction) is similar to G 1 according to τ . But this similarity between G 1 and G 2 is higher according to N DCC, which shows that it is mostly caused by high ranks. A third cluster G 3 containing 7 measures (among which J-measure) emerges, as well as a fourth cluster G 4 containing 5 measures (among which Piatetsky-Shapiro), which is Interestingly, we observe from the dendrograms in Figure 4 that according to N DCC, G 1 and G 2 are very similar. The same is true for G 3 and G 4 . This difference between ranking measures illustrates the importance of accounting for rank positions. When the top of the ranked association rules is considered more important, some similarities between clusters emerge. We illustrate this behavior in Figure 6 by displaying the average rank difference between Confidence(G 1 ) and both Accuracy(G 2 ) and Gini (G 2 ). This experiment clearly shows that when focusing on the top-20 (Overlap@20) rules the average rank difference between Confidence and both Accuracy(G 2 ) and Gini (G 2 ) is small. The same situation occurs between rankings obtained by G 3 and G4. This explains the differences that emerge in clustering interestingness measures when using N DCC/Overlap and τ /Spearman.
Explaining clusters While using hierarchical clustering on interestingness measures allows the discovery of clusters of similar measures, it does not fully ex- Fig. 6. Rank correlations plain which types of results are favored by each of them. We propose to compare the output clusters according to the two most basic and intuitive interestingness measures used in data mining: Confidence and Recall. Confidence represents how often the consequent is present when the antecedent is, that is, P (B|A). Its counterpart, Recall represents the proportion of target items that can be retrieved by a rule, that is, P (A|B).
We present in Figure 7, the average Confidence and Recall values obtained on the top-20 rules ranked according to each interestingness measure. The cluster G 1 containing Confidence scores the highest on this dimension, but achieves a really low Recall. G 2 is extremely close to G 1 , but achieves a slightly lower Confidence and Recall. After that, we have in order of increasing Recall and decreasing Confidence G 3 and G 4 . Finally, G 5 which contains Recall achieves the highest value on this dimension while having the smallest Confidence. Figure 7 also shows that executing a Euclidean distance-based clustering, such as k − means, with the Recall/Conf idence coordinates leads to similar results as with hierarchical clustering. These results are summarized in Table 4.

Rankings comparison for demo assoc
For demo assoc, we adopt exactly the same protocol as for prod assoc. We generate a set of association rules A → B, where B is a product category among a set of 16 representative categories that were selected by our analysts. For each product category B, analysts seek to answer the following question: which customer segments A to target with products in category B. Overall we obtain Similarly to prod assoc, our summarization results in 5 clusters (we omit the figure due to space limitations). The first two clusters G 1 and G 2 remain unchanged. A third cluster G 3 contains 7 measures (including Klosgen and Implication index) is very similar to G 1 according to N DCC (due to accounting for high ranks). We obtain a fourth cluster G 4 containing 3 measures (Pearson's χ 2 , J-measure and Two-way support variation) and a fifth cluster G 5 containing 4 measures (Recall, Collective strength, Cosine, Jaccard). Our hypothesis is that the observed difference between clusterings obtained for demo assoc and prod assoc is mainly due to high values of P (A) in demo assoc unlike the prod assoc scenario.

Running time and memory consumption
Our development environment is comprised of Python 3.7.0 that invokes j LCM (implemented in JDK 7), for each target product or category on a 2.7 GHz Intel Core i7 machine with a 16 GB main memory, running OS X 10.13.6. Table  6) presents the average running time as well as the memory consumption of prod assoc over 228 target products and that of demo assoc over 16 categories. We note that demo assoc runs slower than prod assoc. This is mainly due to the difference in cardinalities of the constructed transactional datasets: 35, 377, 345 transactions in demo assoc and 1, 083, 901 transactions in prod assoc. We notice a similar trend regarding memory consumption.

Rules that produce the same rankings
We report in this section measures that produce exactly the same rankings. Recall that we are given a set of association rules R= {r 1 , r 2 , ..., r n } to rank. Given two measures m 1 and m 2 and their corresponding ranked lists of association rules L m1 R and L m2 R , m 1 and m 2 produce exactly the same ranking iff L m1 R = L m2 R . More formally, for any two rules r i and r j ∈ R, if m 1 ranks r i before r j then m 2 also ranks r i before r j , i.e., in order to prove that two measures m 1 and m 2 always produce the exact same ranking, one have to prove that: where r m is the rank of the rule r according to measure m. These theoretical dependencies between interestingness measures are studied in both C. Tew et al. [29] and Dhouha [5]. Here, we summarize the group of measures that theoretically produce indistinguishable rankings and give the existing relationship between measures. We do not provide the details of the proofs and kindly refer the reader to C. Tew et al. [29] and Dhouha [5] for the detailed proofs. It is easy to see that: Information Gain is a monotonically increasing transformation of Lift, so they are returning exactly the same rankings. In addition to these relationships, some others can be found in the special case when the target B is fixed. Since our analyst specifies a single target product at a time, for all ranking measures we have P (B) constant. This eliminates some of the differences between the considered interestingness measures. Here, we highlight the measures that give exact rankings when the target is specified.
-{Information gain, Lift, Added Value, Certainty factor, Confidence, Laplace correction } When, the target B is fix some dependencies can easily be proven analytically. For example, we can easily notice that:

User study
We now report the results of a user study with domain experts at TOTAL. The goal of this study is to assess the ability of interestingness measures to rank association rules according to the needs of an analyst. More specifically, we would like to identify which of the interestingness measures are most preferred by our analysts. As explained in Section 4, we identified 5 clusters of similar measures, and selected a representative measure in each cluster for the user study (their names are in bold in Table 3). Representative measures are selected as the ones that most represents each clusters of measures (i.e., with the highest average similarity).
We rely on the expertise of our industrial partner to determine, for each analysis scenario, which family produces the most actionable results. Actionable is interpreted as the most likely to lead to relevant recommendations. This experiment involved 2 experienced analysts: one data scientist and one product manager (co-authors of this paper).
For each mining scenario, prod assoc and demo assoc, we sampled target products and target categories respectively. Each analyst picks a mining scenario among prod assoc and demo assoc for which a target product or a target category must be chosen, respectively. The analyst receives a ranked list of rules. Neither the name of the measure nor its computed values for association rules are revealed because we wanted analysts to evaluate rankings without knowing how they were produced.
For a given scenario and a target product or category, our analysts completed 20 comparative evaluations showing two rankings to be compared with the top-10 rules per ranking. In each case, analysts were asked a global question on which ranking they preferred, and also to mark actionable rules in each ranking. We also collected feedback in a free-text form.

Initial study
In our initial deployment, only a few rules were marked as actionable and most rules were deemed unsurprising regardless of their ranking. After a careful examination of the rankings and of the free-text comments, we found that most rules contained products that had been bundled together as promotional offers, and that most rule antecedents in prod assoc were "polluted" with frequently purchased items.
For instance, gas and plastic bags are present in many rules and only confirm what analysts already know: that most customers purchase gas and plastic bags for their groceries. Similarly, in summer, TOTAL regularly runs offers for multipurpose wipes and for car washing services. Other offers are most subtle and formulated as "2 products among": Evian, Coca-Cola, Red Bull, Lay's Chips, Haribo, Mars, Snickers, Twix, Bounty and Granola. It is hence unsurprising to find rules associating any two of those items.
As a result, we decided to filter out gas and plastic bags from the dataset and to remove from transactions items purchased shortly after a promotional offer (identified by their reduced total price).

Feedback on ranking measures
Our second deployment was more conclusive. In summary, we observed that rankings that favor Confidence are best to determine which products to promote together, and rankings that favor Recall are well-suited for the case where a product is given and the goal is to find who to target. These conclusions resulted from deploying comparative evaluations for 5 products for prod assoc and 5 categories for demo assoc.
In the case of prod assoc, the most preferred cluster was G 1 , and an overwhelming proportion of rules in that cluster were marked as actionable. The next most preferred in this same scenario is G 2 . Both G 1 and G 2 favor Confidence, i.e., P (B|A), and reflect the case where a product is given and the goal is to find which other products A to bundle it with in a promotion.
We summarize the feedback we received.
1. Associations between Coffee/Coke and other products: Coffee has a high confidence with Chocolate bars and other drinks (Water, Energy drinks and Soda). This association was deemed immediately actionable. A similar observation can be made with the association between Coca Cola and Sandwiches, Drinks, Potato chips and Desserts. 2. Association between car-related products: The product Engine Oil has a high confidence with the car wash service TOTAL Wash, Windshield wash and a product for engine maintenance. This association was deemed immediately actionable. A similar observation was made for Tire Spray and TOTAL Wash, Windshield wash and different car wipes products. 3. Associations between a product in different categories: The product Bounty chocolate bar has a high confidence with products in the same category (other Chocolate bars), but also with different Biscuits, Coffee and drinks (Water and Soda). This association was deemed large scope and immediately actionable. 4. Associations between products in the same category: It was observed that the product Petit Ecolier, a chocolate biscuit, had a high confidence with other biscuits. According to our analysts, running offers on competing products is risky from a marketing point of view.
These examples illustrate the overwhelming preference for measures favoring Confidence for prod assoc rules, and the need for domain experts in the loop to assess the actionability of rules, beyond automatic measures.
In the case of demo assoc, the most preferred cluster was G 5 , and an overwhelming proportion of rules in that cluster were marked as actionable. The next most preferred in this same scenario is G 4 . Both G 4 and G 5 favor Recall, i.e., P (A|B), and reflect the case where a product category is given and the goal is to find who to target. In the case of demo assoc, who to target is directly interpreted as which customer segments to target with products in that category. We summarize the feedback we received.
1. Ice cream products are mostly consumed in the region around Paris, in the South of France and in stations on the highway from North to South. That is the case for all consumer segments across all ages and genders. This rule led our analysts to look more carefully into the kind of station at which Ice cream products are consumed (e.g., on highways or not). 2. Hot drinks are less attractive in the South of France. 3. Car lubricants are mostly purchased by seniors (regardless of gender and location).
The above examples illustrate the overwhelming preference for measures offering a high Recall as a ranking measures for demo assoc rules, and the interest of domain experts in finding which are the best customer segments to target with products in a specific category.

Product Recommendation
Recommendation systems are designed to guide users in a personalized way in finding useful items among a large number of possible options. Nowadays, recommendations are deployed in a wide variety of applications, such as e-commerce, online music, movies, etc. Like many retailers, TOTAL expressed a need for an automatic recommendation system to increase customer satisfaction and keep them away from competitor retailers. The deployment chosen by our business partners is to first design and evaluate recommendations using the synthesized interestingness measures, and then choose the right one for an actual deployment campaign in gas stations and for running personalized promotional offers.

Recommendation through Association Rules
Recommendation systems can benefit from association rules extraction [11,26]. As shown in the experimental study by Pradel et al. [24], association rules have demonstrated good performance in recommendations using real-world ecommerce datasets, where explicit feedback such as ratings on the products is not available. Thus, it appears necessary to evaluate the performance of recommendations based on association rules mining on our dataset.
Association rules were first used to develop top-N recommendations by Sarwar et al [26]. They use support and confidence to measure the strength of a rule. First, for each customer, they build a single transaction containing all products that were ever purchased by that customer. Then, they use association rule mining to retrieve all the rules satisfying a given minimum support and minimum confidence constraints. To perform top-N recommendations for a customer u, they find all the rules that are supported by the customer purchase history (i.e., the customer has purchased all the products that are antecedent of the rule). Then, they sort products that the customer has not purchased yet based on the maximum confidence of the association rules that were used to predict them. The N highest ranked products are kept as the recommended list. Authors in [11] use a very similar approach but they also consider additional association rules between higher-lever categories where it is assumed that products are organized into a hierarchical structure.
These works present two main drawbacks. First, they require specifying thresholds on support and confidence which might be hard to adapt for different customers, and which results in the inability to recommend products that are not very frequent. Second, searching for rules where the whole purchase history of a customer is included in the antecedent, might lead to a very low or insufficient number of associations. Thus, for every customer who purchased a single product that fails the minimum support constraint the approach cannot compute recommendations. To overcome these drawbacks, we adapt the approach of Pradel et al. [24] using bi-gram association rules which consists in computing the relevance of the association rules (l → k) for every pair of products l and k. In computing relevance of a rule, we do not restrict ourselves to confidence and leverage the results we obtained on the synthesized measures to compare how different interestingness measures behave in practice (i.e., in providing accurate recommendations). In fact, as we show in Table 7, for the same anonymized customer, different interestingness measures (in this case: Confidence,Least-Contradiction and Piatetsky-Shapiro) provide different top-5 product recommendations. More formally, let U = {u 1 , u 2 , ..., u m } be the set of all customers and I = {i 1 , i 2 , ..., i n } be the set of all products. For a given customer u, H u ⊆ I denotes the purchase history of u , the set of all products ever purchased by u. The training stage of the algorithms we evaluate takes as input a purchase matrix, where each column corresponds to a product and the customers that have purchased it, and each row represents a customer and the products she purchased.
We denote P P P the purchase matrix of the m customers in U over the n products in I. An entry p u,i in the matrix contains a boolean value (0 or 1), where p u,i = 1 means that product i was bought by customer u at least once (0 means the opposite).
We leverage the purchase history of our customers to extract association rules of the form i ⇒ j, which means that whenever a customer purchases the product i (antecedent), she is likely to purchase the product j (consequent). Therefore, We use bigram rules to compute an association matrix A A A between each pair of products i, j. The matrix A A A is computed from the purchase matrix P P P , where each entry a j,l corresponds to the interestingness of the association rule j ⇒ l.
The training phase of the approach consists of the computation of all available bi-gram association rules, and stores the corresponding values of the strength on association rules in the association matrix A A A of size n × n. Once, the association matrix is computed, to generate top-N recommendations for customer u, we first identify a set of association rules that are supported by the purchase history of u. i.e., rules of the form k ⇒ l, where k is purchased by u. Then, non purchased products are ranked either by their maximum value [26], [24], or the sum of values [11] of all association rules. In our case, the max aggregation was found to give slightly better results. This could be explained by the fact that given a target product purchased by the test customer in the test set (e.g, cleaning wipes), if the customer purchased in the past food and drinks products frequently and car wash products (e,g. windshield washer ) less frequently. Using the sum aggregation will result to a poor prediction for the target cleaning wipes even if it is highly associated with the windshield washer because of the poor values of associations with other food and drink products that the customer purchased.
Thus, we compute the score of a product j for a customer u as follows: where, H u is the purchase history of customer u, and i is a candidate product for recommendation. Products are then sorted according to their respective scores and the top-N products are recommended to u.

Experiments
Protocol In this section, we present our experimental protocol and the evaluation measures we use in our experiments. The widely used strategy for evaluating recommendation accuracy in offline settings is to split the dataset into training and test sets. The test set is used to simulate future transactions (ratings, clicks, purchases, etc) and it usually contains a fraction of transactions. The remaining interactions are kept in the training set and are fed to the recommendation algorithm to output a list of top-N product recommendations for each user. The accuracy of recommendations is then evaluated on the test set. However, this setting does not reflect well the reality in the retail context as it is time agnostic.
The availability of timestamps in the purchase records enables us to attempt a more realistic experiment. We hence train our algorithm on past purchases and test the results on future purchases. We split the dataset according to a given point in time which acts as our "present" (the time we apply our algorithm). Purchases that happened before the split point are used for training, whereas future purchases after the split point are used for testing. Customers whose purchase histories are timestamped only after the split point are discarded. For our dataset, we choose 1st January 2019 to be the split date. More specifically, we use purchase records from January 2017 to December 2018 for training and records from January 2019 to December 2019 for testing.
As it is often practiced in the recommendation literature [8,25], for our experiments we discard customers who purchased fewer than 5 products in the training set. An important aspect of our dataset and of all datasets in the retail domain is the tendency to repetitively purchase the same products at different times. It is however much more valuable for the customer and even for the retailer to recommend products that the customer has not purchased recently, or is not aware of. In addition, we noticed that if we simply randomly select N products from the purchase history of each customer as the top-N recommendations, we can reach reasonable accuracy. Thus, after several exchanges with the marketing department at TOTAL, for each test customer we decided to remove the "easy" predictions from the test set corresponding to the products that have been purchased by that customer during the training period. We also select only customers who had more than 10 purchases after removing already purchased products in the test set. This setting makes the task of predicting the correct products harder but potentially more impactful in a real-world scenario.
Evaluation Measures A recommendation algorithm outputs a sorted list of top-N product recommendations given the purchase history of a target customer. Top-N recommendations are typically evaluated in terms of their precision, recall and F1-score [6,7]. For each customer u, precision measures the percentage of recommended products that are relevant, recall measures the percentage of relevant products that are recommended, whereas, F 1-score is defined as the harmonic mean of precision and recall. In our setting, a product i is relevant to a customer u if u has effectively purchased i in the test set.
In our approach, we have a set of test customers with a corresponding target set of products (recall that the target set contains the customer purchases in the test data). For a given customer u, the precision, recall, F 1-score and of the top-N recommendations are respectively defined as follows: where given a customer u, T u is the target test set and R u @N is the set of top-N recommendations.

Results
In our experiments, each row of our training purchase matrix contains all known purchases of training customers before the split date at which training and test sets are separated: January 1 st , 2019. All algorithms using the different selected interestingness measures are evaluated using exactly the same test customers and the corresponding target sets. The reported performance results are computed following the experimental protocol described in Section 6.2 and using the evaluation measures reported in Section 6.2.
The values of recommendation accuracy: P recision@10, Recall@10 and F 1@10 for each interestingness measure are reported in Table 8. First, we can notice that G 1 achieves the best recommendation performance and performs slightly better than G 2 . The performance results confirm our findings in the user study where our domain experts preferred group G 1 and measures that favor confidence for the prod assoc scenario. Second, we notice that the achieved recommendation accuracy for groups G 1 and G 2 are very close (12.56% and 12.08% for P recision@10, respectively). The same occurs with very similar performances for groups G 3 and G 4 (10.95% and 10.56% for P recision@10, respectively). These results are consistent with the clustering that we performed using N DCC (Figure 4b). Since we compute top-10 lists per customer, N DCC gives more importance to associations rules in the top of the lists. This explains the similarities of recommendation performances for G 1 and G 2 as well as groups G 3 and G 4 , as the average distance between G 1 and G 2 in the dendrogram in Figure 4b is 0.15 and the average distance between G 3 and G 4 is 0.17. Then, we notice a really poor performance for measures in group G 5 that are not usable in practice. This is mainly due to the fact that measures having a very low confidence favor rare targets over frequent ones for ranking association rules, which results in recommending mostly irrelevant products. We also noticed that some measures that are in the same cluster may not produce similar recommendation performance. In fact, we also produced recommendations using measures within the same cluster and found that the recommendations were different for some cases. We conjecture that this is due to the fact that the obtained rules for computing recommendations focused on different subsets of purchased products according to different users and exhibit the same phenomenon as the Simpson Paradox. For instance, using Accuracy (G 2 )as a ranking measure leads to a poor performance values (0.91% for precision@10 ), while using Least Contradiction (G 2 ) gives much better results (12.08% for precision@10), even if both measures are in the same cluster.
Finally, we implemented a MostPop baseline which is the method that were used so far by our analysts and which consists of a non personalized method that recommends to each customer the set of most popular products that the customer did not purchase yet. We can see from our results that except for measures in group G 5 , all others groups of measures perform better than the non personalized baseline. In particular G 1 show an improvement of 53.54% in relative performance.

Related work
To the best of our knowledge, this paper is the first to bring a framework for association rule mining to the marketing department of an oil and gas company, and empower domain experts with the ability to conduct large-scale studies of customer purchasing habits.
The definition of quality of association rules is a well-studied topic in statistics and data mining. In their survey [4], Geng et al. review 38 measures for association and classification rules. They also discuss 4 sets of properties like symmetry or monotony, and how each of them highlights different meanings of "rule quality", such as novelty and generality. However, we observe no correlation between these properties and the groups of measures discovered using our framework.
These 38 measures are compared in [14]. Authors consider the case of extracting and ranking temporal rules (event A→event B ) from the execution traces of programs. Each measure is evaluated in its ability to rank highly rules known from a ground truth (library specification). We observe that the measures scoring the highest are all from the groups identified in this work as G 1 and G 2 , which are also favored by our analysts. There are however some counterexamples, with measures from G 1 scoring poorly. The main difference between our work and [14] is the absence of a ground truth of interesting rules for our dataset.
A close work to ours is Herbs [15]. Herbs relies on a different and smaller set of measures to cluster rule rankings. Authors perform an analysis of the properties of measures, in addition to an experimental study. The datasets used are from the health and astronomy domains. Each of them contains at most 1,728 transactions and leads to the extraction of 49 to 6,312 rules. Rankings are then compared between all pairs of measures using Kendall's τ correlation measure averaged over all datasets. The largest group of measures identified, which includes Confidence, is similar to G 1 .
Our use of the p-value (via Pearson's χ 2 test) in the evaluation of rule interestingness is borrowed from [17]. In that work, the authors propose an exploration framework where rules are grouped by consequent and traversed by progressively adding items to the antecedent. The framework provides hints incrementally to help guess how each additional item would make a difference. Such a framework is suitable to some of the scenarios we consider and could be integrated in a future version of our work.
Other significant works on clustering interestingness measures include [5,2,29]. In these studies, 61 measures are analyzed from both a theoretical and an empirical aspect to provide insights about the properties and behavior of the measures according to association rule ranking. The number of measures studied in these works is greater than ours. However, our work goes a step further as (1) we provide a user study performed with domain experts from TOTAL marketing department, (2) we show how association rules can be used to perform top-N recommendations, and (3) we show a comparative evaluation of the synthesized interestingness measures according to accuracy measures.
An interesting research area is OLAP pattern mining, which integrates online analytical processing (OLAP) with data mining so that the mining can be performed in different portions of the database [18,23]. However, the focus of our work is not on expressivity nor is it on performance computation. An interesting research direction would indeed be to extend our framework to using the full power of OLAP.

Conclusion
We present our framework to enable decision support through mining, ranking, and summarization of association rules. We use large longitudinal TOTAL datasets that comprises of 30 million unique sales receipts, spanning 35 million records. In conjunction with domain expert non-scientists, we studied two scenarios: associations between a set of products and a target product, and between customer segments and product categories. Both of these scenarios led to actionable insights leading to effective decision support for the TOTAL marketers. We empirically studied 35 interestingness measures for ranking association rules and further summarize them in 5 synthesized clusters or groups. Resulting groups were then evaluated in a user study involving a data scientist and a domain expert at TOTAL. We concluded that ranking measures ensuring high confidence, best fit the needs of analysts in the case of prod assoc, and measures that ensure high recall are better in the case of demo assoc. Finally, we discussed how our findings can be used to perform product recommendation using different interestingness measures for ranking association rules.