gcl {gcl} | R Documentation |
gcl
is an R function that computes a fuzzy rules
classifier given numeric input data as the data frame or matrix
mydata
.
gcl
returns an R function that implements the computed
classifier.
classifier <- gcl(mydata, nlev=3, filter=1.2, multi=NULL, gcl.verbose=F, ...) classifier <- sgcl(mydata, cb=gcl, s.fold=4, s.verbose=FALSE, s.eval=acc.eval, ...) classifier <- tcl(mydata, t.nlev = 3, g = gainr, inf.lim = 0.5, ...)
mydata |
The input data frame or matrix must have column names. The last column is taken to contain the class labels. All entries but the entries in the last column must be numerical. |
nlev |
=<integer larger than 1> Default value: 3 Sets how many fuzzy sets the values in each columns will be rep- resented by. The fuzzy sets have triangular shape and are deter- mined by three numbers, the first 0 crossing, the 1 crossing, and the last 0 crossing. Memberships before the first and after the last are 0. |
filter |
=<positive real number in open unit interval> Default value: 1.2 What data to use for empirical filtering of the rules following the rule generation stage. The objective of the filtering is to remove redundant rules. The data used for this is determined according to the following rules: If filter is NULL, no filtering is done. If filter is a matrix or data frame, this will be used. If filter is an index vector (boolean or integer), the rows in the data indexed by the index vector are used for filtering. If filter is a positive real number, a subset of the data set will be sampled from the data supplied such that each row has a probability equal to 1 minus the fractional filter value, i.e., 1 - (filter - floor(filter)), to be used for construction of the rules. If filter < 1, then the data not used for rule computa- tion will be used for rule filtering, i.e., compute redundant rules and remove these. If filter >= 1, then all the data will be used for filtering. |
multi |
=<NULL or positive integer> Default value: NULL If multi is NULL, rules are created from the entire input data set. If multi is not null, the input data is partitioned into multi equally sized sets. Rules are created from each of the (multi - 1) possibilities of forming unions of (multi - 1) of these sets. The concatenation of the resulting lists of rules is taken as the output of the rule generation stage. |
gcl.verbose |
=<TRUE or FALSE> Default value: TRUE Make gcl output a little info while running. |
cb |
=<classifier builder function> Default value: gcl Which classifier builder to use. |
s.fold |
=<positive integer> Default value: 4 How many-fold the cross validation is to be in sgcl. |
s.eval |
=<function(classifier function, data)
returning a numeric matrix> Default value: acc.eval computing accuracy The evaluator used by sgcl. |
s.verbose |
=<TRUE or FALSE> Default value: FALSE Make sgcl output a little info while running. |
t.nlev |
=<integer larger than 1, or 0> Default value: 3 Sets how many fuzzy sets the values in each columns will be rep- resented by. The fuzzy sets have triangular shape and are deter- mined by three numbers, the first 0 crossing, the 1 crossing, and the last 0 crossing. Memberships before the first and after the last are 0. Can be set to 0 in order to build a non-fuzzy classification tree. |
g |
=<function taking two vectors of equal length returning a number> Default value: gainr The splitting function used by tcl. Two implemented choices are gain and gainr. gain is the information theoretic function of the same name, gainr is the gain ratio function. |
inf.lim |
=<non-negative real number> Default value: 0.5 If the information content in the outcome attribute is less than this limit for the current partition class under consideration, tcl will not split further. |
gcl
This function computes a fuzzy rules classifier given numeric input data as the data frame or matrix mydata.
The algorithm for doing so is described in Vinterbo et al., 2005.
When applied, gcl
returns another R function that implements the found
classifier. This computed classifier function takes one argument, a
vector, matrix or data frame to be classified, and outputs a vector of
class memberships for each input vector, matrix or data frame row. (See
examples section below).
Even though the paper cited above is on classification using gene expression data, numerical data in general can be used. For instance
> library(gcl) > library(datasets) > data(iris) > classifier <- gcl(iris, nlev=5) > acc.eval(classifier, iris)computes a fuzzy rule classifier for Edgar Anderson's Iris Data set and evaluates the classifier accuracy on the same data set.
The function gcl
can also be given an optional argument cfun =
function(attribute.values,outcomes,...)
that given a vector
attribute.values and a vector outcomes determines the inclusion cost
that should be associated with the attribute that has the values found
in attribute.values. An example could be
function(a,b) 1/abs(cor(a,b))
that associates less cost with an attribute that has a higher absolute
value correlation with the outcome. Note that the values given to the
function cfun are the values for the attribute after discretization.
computed classifier
The computed classifier is a function that takes one argument, the numeric vector, matrix or data frame to be classified. When applied it outputs a vector of class memberships for each input vector, matrix or data frame row. The input data has to have (column) names compatible with the names of the data from which the classifier function was generated. Otherwise, the classifier function cannot operate.
The data supplied to the computed classifier function cannot contain non-numeric data. Specifically, if a classifier input data frame contains a non-numeric class labels column (typically a factor), this must be removed before application. Much like:
> classifier(inputdata[-ncol(inputdata)])if the offending column is the last one.
The computed classifier function can be “dumped” to file by using
R's dump
function. If classifier is the name of the computed
function,
then
> dump("classifier","classifier.r")creates a file ‘classifier.r’ containing the R source code of the function classifier. This source code can then be distributed and will work as a stand-alone program.
If the computed classifier function is supplied with no, or a NULL
,
argument, it will return a documentation string. The content of this
string is decided by the value of the gcl.decorate
option at the
time of the gcl call. If getOption("gcl.decorate")
returns 1, the string
contains the fuzzy rules in a human readable format, if it returns 2
(default), each rule is also followed by the three numbers determining
the membership functions of each antecedent fuzzy proposition. If
returns NULL, no information about the rules is generated. This might be
used to save space and loading time.
The computed classifier function returned has three attributes that can
be accessed by the attributes()
function. They are
summary.gcl.rnum
,
summary.gcl.amean
, summary.gcl.natt
and summary.gcl.nlev
.
If getOption("gcl.decorate")
returns a positive number, they contain the
number of rules in the classifier, the average number of attributes in
the rule antecedents, the number of distinct attributes found in the
rules, and the value of the nlev
parameter passed to the gcl
function. The classifier function object returned by tcl
has similar
attributes.
sgcl
The function sgcl
partitions the input data mydata
into
two data sets, training and holdout. It then performs a n-fold (given
by the parameter s.fold
) cross validation over the training set,
using the classifier builder cb
(default gcl
) to generate
classifiers. This process results in classifiers c_i for
i = 1,2,...,n with associated performance measures
p_i. Each classifier c_i generated
during the cross validation is applied to the holdout data set,
resulting in associated performance measure q_i.
For each classifier c_i,
the expression
(q_i + p_i)/2 * 1/(1 + |q_i - p_i|)
is evaluated, and the
classifier that maximizes this expression is returned by sgcl
. The
rationale for this is that we want the classifier with the best
consistent performance. In addition to the arguments listed above,
sgcl
takes the arguments that cb
and cv
take. The
default performance measure used by sgcl is accuracy
as computed by acc.eval
. Ties are broken arbitrarily.
tcl
The experimental function tcl
computes a classification tree classifier
using a recursive partitioning algorithm similar to ID3.
The functions gcl
, tcl
, and sgcl
return a
function representing the computed classifier.
The computed classifier function returns a matrix with as many
columns as the original data had class labels, NULL
, or a text
string representing a description of the classifier.
If the column names do not match between the original data and the
data to be classified by the computed function, the error
Error in x[[ind]] : subscript out of bounds
is likely.
Note that applying sgcl to small data sets is not advisable as the data is split repeatedly, making the learning and filtering sets even smaller.
Staal A. Vinterbo (C) 2007
staal@dsg.harvard.edu
Vinterbo, S.A.; Kim, E. and Ohno-Machado, L. Small, fuzzy and interpretable gene expression based classifiers. Bioinformatics, 2005, 21, 1964-1970. http://bioinformatics.oxfordjournals.org/cgi/content/abstract/21/9/1964
## run the demo demo(gcldemo) ## play with the iris data set: ## Not run: library(datasets) data(iris) classifier <- gcl(iris, nlev=5) acc.eval(classifier, iris) ## End(Not run) ## compare performance of gcl and tcl ## Not run: library(datasets) data(iris) cv52(iris, gcl, tcl, acc.eval, nlev=5, t.nlev=5) ## End(Not run) ## or a little more complex library(gcl) count <- matrix(c(0,0,0,1,1,0,1,1),ncol=2,byrow=TRUE) xordata <- cbind(count, apply(count, 1, function(x) xor(x[1],x[2]))) colnames(xordata) <- c("Bit.1", "Bit.2", "XOR") cf <- gcl(xordata,2,c()) cat(cf()) ## Not run: # should produce something like: Generated by gcl v1.06c Sat Nov 12 19:25:12 2005. nlev=2, filtering: no filtering took place rule generation: no subsampling. (c) Copyright 2005, Staal Vinterbo, all rights reserved. Bit.1=2 & Bit.2=2 => XOR=0 [ 0 1 Inf ],[ 0 1 Inf ] Bit.1=2 & Bit.2=1 => XOR=1 [ 0 1 Inf ],[ -Inf 0 1 ] Bit.1=1 & Bit.2=2 => XOR=1 [ -Inf 0 1 ],[ 0 1 Inf ] Bit.1=1 & Bit.2=1 => XOR=0 [ -Inf 0 1 ],[ -Inf 0 1 ] ## End(Not run) v <- c(0,1) names(v) <- colnames(xordata)[1:2] cf(v) ## Not run: # produces: 0 1 [1,] 0 1 dump("cf", "cf.r") rm(cf) source("cf.r") cf(v) # produces: 0 1 [1,] 0 1 ## End(Not run)