yourprep {YourCast}R Documentation

Data object creation wizard for YourCast

Description

Builds the data object for yourcast function from files in working directory or other specified directory and checks for errors

Usage

yourprep(dpath=getwd(), tag="csid", index.code="ggggaa",
                     datalist=NULL, G.names=NULL, A.names=NULL,
                     T.names=NULL, adjacency=NULL, year.var=FALSE,
                     sample.frame=NULL, summary=FALSE, verbose=FALSE) 

Arguments

dpath String. Name of the directory where data files are stored. If NULL then defaults to working directory. Default: NULL
tag String. Group of characters placed before CSID code in filenames to indicate which files in dpath function should load. The tag can also be used to differentiate between different groups to be considered in separate analysis; for example, ‘m’ for male deaths and ‘f’ for female deaths. Default: "csid"
index.code String indicating how the CSID index variable is coded in the input data. Between 0 and 4 of the following two characters are used in this order: g for the geographic index (such as country) and a for a grouped continuous variable like an age group. For example, ggggaa would have the function interpret ‘245045’ by using ‘2450’ as the country code and ‘45’ as the age group. Default: "ggggaa"
datalist A list of cross section dataframes already loaded into the workspace to be added to the dataobj. Names of list elements should be the numerical CSID code for each cross section, and dataframes should be formated identically to files loaded from an external directory (see Details)
A.names, G.names, T.names String. Filename of optional two-column data files that list all valid numerical codes (in the first column) and corresponding alphanumeric names (optionally in the second column) for the indices corresponding to geographic areas in G.names, age groups in A.names, and time periods in T.names. Function will search dpath for file with specified name; please include column labels. The optional alphanumeric identifiers are most commonly only used for geographic areas since numerical values for age groups and time periods are usually meaingful on their own. However, if other grouped continuous variable used in place of ages, for example, specifying these labels will be important for output to be meaningful. NOTE: Auxiliary files will loaded automatically by yourprep() if they are saved in the dpath and labeled with the tag specified by the user. See ‘Details’ section for more infromation. Default: NULL
adjacency Data file with codes to construct the symmetric matrix (geographic region by geographic region) of proximity scores for geographic smoothing used by the ‘map’ and ‘bayes’ methods. The larger the relative score, the more proximate that pair of countries is in the prior; a zero element means the two geographic areas are unrelated (the diagonal is ignored). Each row of the proximity file has three columns, consisting of geographic codes for two countries and a score indicating the proximity or similarity of the two geographic regions; please include column labels. For convenience, geographic regions that are unrelated (and would have zero entries in the symmetric matrix) may be omitted from proximity. In addition, proximity may include rows corresponding to geographic regions not included in the present analysis. Default: NULL
year.var Boolean. Should be TRUE if year coded as separate variable rather than as rowname for cross section data files. Function will look for year variable to use as rownames and then drop it from the dataframe. Change will only be made to dataframe if it does not already have rownames or if exisiting rownames are merely a ‘1...N’ index of row numbers, so it is possible to apply correction even if some cross sections do not have a year variable and already have the correct rownames. Default: FALSE
sample.frame Optional four element vector containing, in order, the start and end time periods to be used for the observed data and the start and end time periods to be forecast. All cross sections do not have to begin at starting date, but must contain all years after the first observed value. Variables to be forecasted should be coded as NA in the out-of-sample period. Note that this makes it easy to reserve a range of values of the dependent variable for out-of-sample forecasting evaluation; our summary and plot functions in yourcast will make these comparisons automatically if the out-of-sample data are included. yourprep() uses this information only to verify that cross sections are correctly constructed. Default: NULL
summary Boolean. If TRUE, means for available observations on each variable are displayed for the cross sections read by yourprep(). Default: FALSE
verbose Boolean. If TRUE, function prints name of each cross section or auxiliary file as it is read into the dataobj. Default: FALSE

Details

Creates dataobj input for yourcast from files in working directory or other specified directory. Checks that all cross sections in data list titled properly and if all years up to last predicted year included in the dataframes (if sample.frame argument specified). Please note, however, that all cross sections from the same geographic area must have the same observation and prediction years in the dataframe (even if NA) for the graphing software plot.yourcast to work.

The cross section files must be named according to the CSID identifiers for country code and age group, preceeded by the specified tag (default: "csid") so that yourprep() can identify the file from other files in the dpath. For example, for the USA (country code 2450) time series of 45 year old individuals, the file name should be ‘csid245045.txt’ if the tag is left as the default. Files must have an extension so that the program can recognize how the data is coded. Currently, fixed width text files (‘*.txt’), comma-separated values (‘*.csv’), and Stata v.5-10 (‘*.dta’) files are supported, and multiple file types may be used in the same run of the program. ‘*.Rdata’ objects can be included with the datalist option after they are loaded to a list in the workspace. yourprep() includes diagnostics to ensure that objects are properly named and not included accidentally, but users should examine the specified dpath before running yourprep() to minimize errors.

Each cross section file should be labeled columns of time-series data for the dependent variable(s) (e.g., disease, pop) and the covariates that will be used in the forecast. The rownames for the dataframe should be the observation year (if the year is coded as a separate variable, set year.var=TRUE). The files must contain the full time series that will be specified in the sample.frame argument in yourcast after the first observed year. For instance, if sample.frame=c(1950,2000,2001,2030), then files would have observations that start between 1950 and 2000 and include all other years (even if the entries are NA) up to the last year of prediction, i.e., 2030.

Optional auxiliary files such as G.names should be named according to the filename specified in the respective arguments. If specified, these files must have extensions and be coded in one of the three supported file types. However, these files will be automatically loaded by yourprep() if they are saved in the dpath and labeled with the tag specified by the user. The default names for these files must be used (e.g., ‘G.names’ and ‘adjacency’). For example, if the tag is left as the default and there is a file in the dpath labeled ‘csid.G.names.txt’, yourprep() will load this automatically and save the input as the G.names element of the ‘dataobj’ list. yourprep() arguments such as G.names take precedence over ‘TAG.*’ files in thedpath.

Value

dataobj A list with several components:

data
A list with the cross-sectional data matrices as elements.

proximity
A symmetric matrix (geographic region by geographic region) of proximity scores for geographic smoothing used by the ‘map’ and ‘bayes’ methods. The larger each element of the matrix, the more proximate that pair of countries is in the prior; a zero element means the two geographic areas are unrelated (the diagonal is ignored). Each element of the symmetric matrix is created from one row of the proximity input to yourprep() (which is two country codes and a proximity score).

G.names, A.names, T.names
Optional two-column dataframes that list all valid numerical codes (in the first column, labeled codes) and corresponding alphanumeric names (optionally in the second column, labeled name) for the indices corresponding to the geographic areas in G.names, age groups in A.names, and time periods in T.names.

index.code
A string indicating how the index variable is coded in the input data.

Author(s)

Jon Bischof jbischof@fas.harvard.edu

References

http://gking.harvard.edu/yourcast

See Also

yourcast function and documentation (help(yourcast))

Examples

# Working directory automatically set to directory with cross
# section and auxiliary files to begin. Files for this example
# in 'data' folder of YourCast library.

#Old working directory to be restored later
oldwd <- getwd()
# Now setting wd to 'data' folder in YourCast library
setwd(paste(.libPaths()[1],"/YourCast/data",sep=""))

# Simple run of the function, using option that turns year variable
# into label in each cs. Use sample.frame argument for all diagnostics
# to work
 
dta <- yourprep(G.names="cntry.codes.txt",adjacency="adjacency.txt",
year.var=TRUE,verbose=TRUE,sample.frame=c(1950,2000,2001,2030))

# With summary output (means of variables in each cross section) 

## Not run: 
dta <- yourprep(G.names="cntry.codes.txt",adjacency="adjacency.txt",
year.var=TRUE,summary=TRUE)
## End(Not run)

# Function can also add datafiles already loaded into R as objects in
# the workspace with "datalist" option if put into a list and properly
# labeled. All diagnostics still performed 
# 'csid204545', etc., are dataframes in workspace

# Labels changed to nonsense ones so as not to confuse with other files

data(csid204545)
data(csid204550)
data(csid204555)

datalist <- list("123456"=csid204545,"234567"=csid204550,
"345678"=csid204555) 

# Verbose option turned on and datalist argument added 

dta <- yourprep(G.names="cntry.codes.txt",adjacency="adjacency.txt",
year.var=TRUE,verbose=TRUE,datalist=datalist)

# Setting working directory back
setwd(oldwd)
rm(oldwd)

[Package YourCast version 0.9-7 Index]