g1_part {mseq} | R Documentation |
g1.csv
in folder data_top100
stores the counts and sequences of the top 100 genes in the Grimmond EB data.
This file only stores a part of g1.csv
–the top 10 genes.
The reason we only keep a small part is to shorten the calculation time of the example codes.
The full top 100 genes from each datasets are provides as seperate files in the folder data_top100
. Please read Readme_format.txt
in that folder for details about the data and the required format.
This data can be generated by
g1 <- read.csv("g1.csv");
g1_part <- g1[g1$index < 11,]
data(g1_part)
A data frame with 8307 observations on the following 4 variables.
index
tag
seq
T
A
C
G
count
index
is an index for the gene from where this count comes.
tag
is an integer value, 0
means to consider this count, any other value means this count should not be taken into account. In our files, -2
means the UTR part, and -1
means the further 100 bp. The user can use any integer other than 0
to denote the discarded counts.
seq
is the nucleotide of this position. Must be capital A
C
G
T
. No other characters accepted. No little characters accepted. No missing values accepted. If the number of missing values is small, you can use T
(or A
G
T
) for them; this should not change the result significantly.
count
is the count of reads starting at this position.
For each gene (or each group of positions that have the same level of expression, like exon or isoform), a distinguished index should be used. Each gene (or group) may include positions in both strand (like data generated by Illumina) or single strand (like data generated by ABi).
Within each gene (or group), the positions should be in the 5 prime to 3 prime order for each strand. There should be no gaps or missing values.
So actually, for each gene in Illumina outputs, the data are comprised of two halves. The first half are the data from the forward strand, and the second half are the data from the second strand.
For each gene in ABi outputs, there are no such two halves.
For each gene or each half, the nucleotides retained for analysis should be surrounded with long-enough nonretained nucleotides.
For example, if you want to consider left 40 bp and right 40 bp as surrounding sequences, then there should be at least 40 bp in both sides of nucleotides retained.
Right formats are very important; otherwise, the program may give unpredictable results.
This package itself will not justify the correctness of the format. Please make sure you have done it.
Li J, Jiang H, Wong WH, Modeling non-uniformity in short-read rates in RNA-Seq data, submitted.