est.rr {RelativeRisk}R Documentation

Estimates Relative Risk

Description

Estimates relative risk using GLM and Logit-Log translations

Usage

        est.rr(frml,data,indexed=FALSE,theta,doLog=TRUE,doTally=FALSE,sigDigits=2,cProb=c(0.025,0.975))

Arguments

frml a standard formula like y~X1+X2, or a special one like y1|y2~X1+X2, where y1 and y2 are both columns in data
data A data frame.
indexed If TRUE, y2 in the formula is an index with values (0,1), indexing the counts in y1. If FALSE, y2 in the formula is the "failure" column corresponding to the "success" column in y1.
theta For case-control data. The ratio of population probabilites. A two column (case,control) response is required.
doLog If FALSE, GLM estimation will be suppressed, and the relative risks will be estimated by the Logit-Log translation only.
doTally If TRUE, tally.data() will be used to tally the data into strata.
sigDigits Comparison digits used by tally.data().
cProb Confidence probabilites.

Details

This function estimates relative risks from either prospective or retrospective data. It should never fail to produce an estimate. It may however fail to return confidence intervals.

A vignette giving further details is availble. To access it, type

vignette("RelativeRisk")

If doLog==TRUE, it first attempts to estimate relative risks using glm() with a log link. If glm() fails, it tries glm() with modified data (see note). If this fails, it estimates relative risk with a logit link, uses Logit-Log translation (see the to.rr() documentation) to translate the resulting odds ratios to relative risks, and attempts to produce confidence limits with the jackknife.

Several types of response are recognized:

(1) The response may be a two column matrix defined in the calling environment, with columns (success,failure), or (case,control), containing counts.

(2) The response may be a single variable, defined in the calling environment, or the name of a variable in data. The single variable should contain values (0,1), where 1 denotes success (if doTally=TRUE, such a variable will be tallied into a two column matrix with success and failure columns).

(3) The response may be the names of two variables in data containing counts of both success and failure, with the two names separated by a "|", like y1|y2.

(4) The response may be the names of two variables in data with the names separated by a "|", like y1|y2. The first variable y1 should contain counts of both success and failure, and the second y2 index values (0,1) to distinguish the two types: 1 denotes success. indexed should be set to TRUE in this case , and doTally will be set to TRUE by the function.

A dot may be used after the "~" to indicate that all variables in data are to be included linearally in the model

The model is log(p)=a+b'x, where x is a vector of environmental variables. Relative risks are exp(b) when the variables are coded (0,1). When they are coded (l,u), the relative risk is exp((u-l)b). coef(glm) for glm an element of the output list, gives b values. est.rr() returns exp((u-l)b). The function as.twolevel() may be used to code data into two (0,1) levels. The meaning of relative risks for variables with multiple levels is not always clear; however, factors may be used. est.rr() uses contr.treatment when factors are present. This is a (0,1) coding. est.rr() uses a form that contrasts the various levels with the last level; thus a three level factor for variable X, will have coefficients for X1, and X2, representing the relative risk of the first and second levels with respect to the third level; i.e., the risk for the last level is in the denominator.

The formula may be used to create product and other terms. The meaning of relative risk for such terms is not obvious, and they often create problems for the glm algorithm. .

Most data will support only a few variables: over parameterization can cause convergence problems. In addition, GLM is no different from multivariate regression with respect to substantial correlation among the independent variables. If anything, it is more sensitive. Correlation among the variables introduces redundancy into the model, and the resulting estimates become partially confounded, which is why the coefficients sometimes change drastically when a variable is added or deleted.

Retrospective case-control data may be analyzed by setting theta to the ratio of population probabilities. It is assumed that there are two populations: one from which the cases are drawn and the other from which the controls are drawn. The marginal probabilities for these are P and (1-P), and theta=(1-P)/P. doTally should be set to TRUE in order to produce a response with two columns (case, control). The estimates are fairly robust to misspecification of theta. The only time a modest theta will have much effect, is when the cases greatly outnumber the controls.

Value

A list with elements ( log, glm, table). The glm element will be present if a GLM fit with log link was successful and doLog=TRUE, otherwise glmLogit will appear. glm contains the output of glm(). table gives the relative risk estimates and their confidence intervals plus the odds ratios. The relative risk values in table may differ from those obtainable from the coefficients in glmLogit, since the jackknife produces a slightly different estimate. Confidence intervals will be shown in table if either the glm fit with log link was successful or the jackknife worked. The heading of the relative risk column will be labeled LL-rr if the Logit-Log translation was used.
log will contain comments.

Note

GLM with a log link may fail to converge. One reason for this is the fact that the weight function used in solving the likelihood equations can become infinite, Lumley et.al. (2006). The following procedure, with the flavor of a continuity correction, is used to avoid this problem: Given a vector y of (0,1) response values, the vector is replaced by a [success,failure] two column matrix with values [(yK+(1-y)),y+K(1-y)], where K is a large number, say 1000. When y is a two column matrix it is modified to [yK+(y==0)]. This divides the variances of the estimates by K, without changing the scale of the estimates. It also multiplies the log-likelihood by K. As K increases the estimates converge to the correct values. For finite values, the estimates are slightly biased. K=1000 is used in est.rr.

Author(s)

Bob Wheeler bwheelerg@gmail.com

Please cite this program as follows:

Wheeler, R.E. (2009). est.rr RelativeRisk. The R project for statistical computing http://www.r-project.org/

References

Lumley, T.,Kronmal, R. and Shuangge, M. (2006). Relative risk regression in medical research: models, contrasts, estimators, and algorithms. UW Biostatistics Working Paper Series Paper 293. University of Washington. http://www.bepress.com/uwbiostat/paper293

Examples


        data(gradData)
        aa<-est.rr(Count|Admitted~.,data=gradData,indexed=TRUE)
        names(aa)
        aa$log
        aa$table

        aa<-est.rr(Count | Admitted ~.,data=gradData,indexed=TRUE,doLog=FALSE)
        aa$table

        # Simpson's paradox, since the observed relative risks for Males in each department
        # tend to be greater than unity, while the overal all risk is less than unity.
        aa<-est.rr(Count|Admitted~Dept*Male,indexed=TRUE,data=gradData)
        aa$table

        # Convergence difficulties example:
        data(TitanicMat)
        # Reverse the sex variable
        TitanicMatR<-as.twolevel("sex",TitanicMat,1)
        # glm() fails to converge upon reversing the variable.
        # Dropping the Age variable allows convergence in this particular case.
        # as including information in the log.
        aa<-est.rr(count|survived~.,indexed=TRUE,data=TitanicMat)
        at<-est.rr(count|survived~.,indexed=TRUE,data=TitanicMatR)
        aa$table
        at$table

        # Interpreting factors example:
        # The relative risks are for cancer, and the factors are all compared 
        # to their most extreme level: tobacco for example is compared to the heaviest users.
        # This is case-control data, and since cancer is rare, theta should be large, which
        # tends to make the relative risks the same as the odds ratios.
        aa<-est.rr(ncases|ncontrols~.,esoph,theta=1000)
        aa$table

        # Retrospective sampling example with 600 Cases and 300 Controls. 
        # The true relative risks are (0.3,1.5,2.0,0.7,0.3)
        data(simData)
        aa<-est.rr(Success|Failure~.,data=simData,theta=2.2)
        aa$table

        # Very large odds raio:
        data(Melanoma)
        aa<-est.rr(NarrowExc|WideExc~.,Melanoma)
        aa$table


[Package RelativeRisk version 1.1-1 Index]