interestMeasure {arules}R Documentation

Calculating various additional interest measures

Description

Provides the generic function interestMeasure and the needed S4 method to calculate various additional interest measures for existing sets of itemsets or rules.

Usage

interestMeasure(x, method, transactions = NULL, ...)

Arguments

x a set of itemsets or rules.
method name of the interest measure (see details for available measures).
transactions the transaction data set used to mine the associations.
... further arguments for the measure calculation.

Details

For itemsets the following measures are implemented:

"allConfidence"
(see, Omiencinski, 2003) is defined on itemsets as the minimum confidence of all possible rule generated from the itemset.
"crossSupportRatio"
(see, Xiong et al., 2003) is defined on itemsets as the ratio of the support of the least frequent item to the support of the most frequent item. Cross-support patterns have a ratio smaller than a set threshold. Normally many found patterns are cross-support patterns which contain frequent as well as rare items. Such patterns often tend to be spurious.

For rules the following measures are implemented:

"chiSquare"
(see Liu et al. 1999). The chi-square statistic to test for independence between the lhs and rhs of the rule. The critical value of the chi-square distribution with 1 degree of freedom (2x2 contengency table) at alpha=0.05 is 3.84; higher chi-square values indicate that the lhs and the rhs are not independent.
"cosine"
(see Tan et al. 2004) equivalent to the IS measure. Range: 0...1.
"conviction"
(see Brin et al. 1997) defined as P(X)P(not Y)/P(X and not Y). Range: 0.5...1... Inf (1 indicates unrelated items).
"gini"
gini index (see Tan et al. 2004). Range: 0...1.
"hyperLift"
(see, Hahsler et al., 2005) is an adaptation of the lift measure which is more robust for low counts. It is based on the idea that under independence the count c_{XY} of the transactions which contain all items in a rule X -> Y follows a hypergeometric distribution (represented by the random variable C_{XY}) with the parameters given by the counts c_X and c_Y.

Lift is defined for the rule X -> Y as:

lift(X -> Y) = P(X+Y)/(P(X)*P(Y)) = c_XY / E[C_XY],

where E[C_{XY}] = c_X c_Y / m with m being the number of transactions in the database.

Hyper-lift is defined as:

hyperlift(X -> Y) = c_XY / Q_d[C_XY],

where Q_d[C_XY] is the quantile of the hypergeometric distribution given by d. The quantile can be given as parameter d (default: d=0.99). Range: 0... Inf.

"hyperConfidence"
(based on Hahsler et al., 2005) calculates the confidence level that we observe too high/low counts for rules X -> Y using the hypergeometric model. Since the counts are drawn from a hypergeometric distribution (represented by the random variable C_{XY}) with known parameters given by the counts c_X and c_Y, we can calculate a confidence interval for the observed counts c_{XY} stemming from the distribution. Hyperconfidence reports the confidence level (significance level if significance=TRUE is used) for
complements -
1 - P[C_{XY} >= c_{XY} | c_X, c_Y]
substitutes -
1 - P[C_{XY} < c_{XY} | c_X, c_Y].

A confidence level of, e.g., > 0.95 indicates that there is only a 5% chance that the count for the rule was generated randomly.

Per default complementary effects are mined, substitutes can be found by using the parameter complements = FALSE. Range: 0...1.

"improvement"
(see Bayardo et al. 2000) the improvement of a rule is the minimum difference between its confidence and the confidence of any proper sub-rule with the same consequent. Range: 0...1.
"leverage"
(see Piatetsky-Shapiro 1991) defined as P(X->Y) - (P(X)P(Y)). It measures the difference of X and Y appearing together in the data set and what would be expected if X and Y where statistically dependent. Range: {0...1}.
"phi"
the correlation coefficient phi (see Tan et al. 2004) Range: -1 (perfect neg. correlation) to +1 (perfect pos. correlation).
"oddsRatio"
(see Tan et al. 2004). The odds of finding X in transactions which contain Y divided by the odds of finding X in transactions which do not contain Y. Range: 0...1... Inf ( 1 indicates that Y is not associated to X).

Note that for calculating the interest measures support (for rules also confidence and lift) stored in the quality slot of x are needed. These measures are returned by the mining algorithms implemented in this package. Note also, that the calculation of some measures is quite slow since we do not have access to the original itemset structure which was used for mining.

Value

A numeric vector containing the values of the interest measure for each association in the set of associations x.

References

R. Bayardo, R. Agrawal, and D. Gunopulos (2000). Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery, 4(2/3):217–240, 2000.

Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur (1997). Dynamic itemset counting and implication rules for market basket data. In SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, pages 255–264, Tucson, Arizona, USA.

Michael Hahsler, Kurt Hornik, and Thomas Reutterer (2005). Implications of probabilistic data modeling for rule mining. Report 14, Research Report Series, Department of Statistics and Mathematics, Wirtschaftsuniversitaet Wien, Augasse 2-6, 1090 Wien, Austria.

Bing Liu, Wynne Hsu, and Yiming Ma (1999). Pruning and summarizing the discovered associations. In KDD '99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 125–134. ACM Press, 1999.

Edward R. Omiecinski (2003). Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data Engineering, 15(1):57–69, Jan/Feb 2003.

Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava (2004). Selecting the right objective measure for association analysis. Information Systems, 29(4):293–313.

Piatetsky-Shapiro, G. (1991). Discovery, analysis, and presentation of strong rules. In: Knowledge Discovery in Databases, pages 229–248.

Hui Xiong, Pang-Ning Tan, and Vipin Kumar (2003). Mining strong affinity association patterns in data sets with skewed support distribution. In Bart Goethals and Mohammed J. Zaki, editors, Proceedings of the IEEE International Conference on Data Mining, November 19–22, 2003, Melbourne, Florida, pages 387–394.

See Also

itemsets-class, rules-class

Examples

data("Income")
rules <- apriori(Income)

quality(rules) <- cbind(quality(rules), 
        hyperConfidence = interestMeasure(rules, method = "hyperConfidence", 
        Income))

        
inspect(head(SORT(rules, by = "hyperConfidence")))

[Package arules version 0.6-3 Index]