query {seqinr}R Documentation

To get a list of sequence names from an ACNUC data base located on the web

Description

This is a major command of the package. It executes all sequence retrievals using any selection criteria the data base allows. The sequences are coming from ACNUC data base located on the web and they are transfered by socket. The command produces the list of all sequence names that fit the required criteria. The sequence names belong to the class of sequence SeqAcnucWeb.

Usage

query(socket,listname,query, invisible = FALSE)

Arguments

socket a socket of class connection returned by choosebank.
listname The name of the list as a quoted string of chars
query A quoted string of chars containing the request with the syntax given in the details section
invisible if TRUE, the result of the query will be invisible but assigned is the environment

Details

Each selection criterion is written using the following syntax:

c = criterion value
where c indicates which criterion is used. Many selection criteria are available. They correspond mainly to the structured elements of the sequence documentation in the data banks, and are detailled thereafter. Criteria can be combined using 3 logical operations:

criterion1 ET criterion2 : logical AND (sequences that fit criteria 1 and 2 simultaneously).

criterion1 OU criterion2 : logical OR (sequences that fit at least one of both criteria).

NO criterion1 : logical negation (sequences that do not fit criterion 1).

Parentheses can be used to delimit the range of operations. List of sequences can be re-used at will, which is very convenient to fragment complexe requests into simple requests. For instance, here are two equivalent ways to get all coding sequences from Escherichia coli that are not partial:

s=choosebank("genbank")
query(s$socket,"final","sp=escherichia coli ET t=cds ET NO k=partial")
s=choosebank("genbank")
query(s$socket,"eco","sp=escherichia coli")
query(s$socket,"ecocds","eco ET t=cds")
query(s$socket,"final","ecocds ET NO k=partial")

SP = species name
sequences from given (group of) species. The special character @ can be used to match any group of characters in the species name, ex: SP=RATTUS@. Use of space is allowed. Examples: ESCHERICHIA COLI, @COLI, E@COLI. Species names are tree-structured according to the biological classification of species.
K = keyword
sequences having a given keyword. Since keywords are tree structured, as are species, you will select all sequences associated to keywords further down in tree. (@ can be used to match any group of characters)
R = reference code
sequences from a given reference. References are specified as follows depending on the type of document:
Document Format Example
Journal article journal_code/volume/1st_page jme/34/17
Book book/year/1st_author book/1980/broker
Thesis thesis/year/1st_author thesis/1984/wildgruber
Patent patent/patent_coded_number patent/ep0238993
Unpublished, or submitted unpubl/year/1st_author unpubl/1993/cho
J = journal name
sequences published in a given journal.
Y = year
sequences published in given year (e.g. 1982).
Y > year
sequences published after or during a given year.
Y < year
sequences published before or during a given year.
AU = author
sequences published by given author(s). Use @ to specify any letters in name (e.g. @ORMOND@ for Van Ormondt). Only last names are indexed - initials are ignored. All authors of journal articles are indexed. Only the first author of books, theses, patents and other documents is indexed.
T = sequence type
sequences of given type. You generally obtain subsequences with this criterion because types are for example tRNA, rRNA or protein gene. Type should not be confused with molecule which denotes the chemical nature of the sequenced molecule (e.g., DNA, mRNA, tRNA). Type is defined only for the nucleotide sequence banks. Presently the existing types are:
ID Locus entry (EMBL, SWISS-PROT, NRSub)
LOCUS Locus entry (GenBank, Hovergen, EMGLib)
CDS .PE protein coding region (all)
RRNA .RR mature ribosomal RNA (all)
TRNA .TR mature transfer RNA (all)
MISC_RNA .RN other structural RNA coding region (EMBL, GenBank, Hovergen, NRSub, EMGLib)
SNRNA .SN small nuclear RNA (EMBL, GenBank, Hovergen, EMGLib)
SCRNA .SC small cytoplasmic RNA (EMBL, GenBank, Hovergen, NRSub, EMGLib)
3'INT .3I 3' intron (Hovergen)
3'NCR .3F 3' non-coding region (Hovergen)
5'INT .5I 5' intron (Hovergen)
5'NCR .5F 5' non-coding region (Hovergen)
CPG .CG CpGobs/CpGexp>0.5 (Hovergen)
INT_INT .IN internal intron (Hovergen)

Each entry of a FEATURE TABLE describing a coding region of a DNA fragment gives rise to a subsequence equal to the fragments described in the location of the feature. The type of the resulting subsequence equals the key of the corresponding feature table entry. The name of the resulting subsequence is built by adding to the parent sequence's name an extension uniquely identifying this particular feature.

Sequences of a given type are generally subsequences, i.e., fragments of parent sequences, except if the coding region covers totally the parent sequence, in which case ACNUC does not create a subsequence.

O = organelle
sequences from a given organelle. Organelle (e.g., chloroplast, mitochondrion) denotes the nature of the genome that harbors a particular gene. By extension, ACNUC also sees the nucleus as an organelle. Also, a nuclear-encoded gene coding for a protein exported to an organelle is considered as a nuclear gene. The existing organelles are:
CHLOROPLAST Chloroplast genome (EMBL, GenBank, NBRF, Hovergen)
MITOCHONDRION Mitochondrial genome (EMBL, GenBank, NBRF, Hovergen)
KINETOPLAST Kinetoplast genome (EMBL, GenBank, Hovergen)
NUCLEAR Nuclear genome (all)
M = molecule name
sequences with given chemical structure. In ACNUC, molecule denotes the chemical nature of the sequenced molecule (e.g., DNA, mRNA, tRNA). Molecule should not be confused with type which identifies the encoded molecule (e.g., protein, tRNA, rRNA). Thus the sequence of a tRNA gene has DNA for molecule because DNA rather than tRNA was sequenced. The subsequence covering the tRNA region has tRNA for type because this is the nature of the encoded product. Molecule is defined only for the nucleotide sequence banks (GenBank, EMBL, Hovergen, NRSub, and CGDB). Presently the existing molecules are:

DNA Sequenced molecule is DNA (all)
RNA Sequenced molecule is RNA (all)
MRNA Sequenced molecule is mRNA (GenBank, Hovergen)
RRNA Sequenced molecule is rRNA (GenBank, Hovergen)
TRNA Sequenced molecule is tRNA (GenBank, Hovergen)
URNA Sequenced molecule is snRNA (GenBank, Hovergen)
N = sequence name
sequence of given name.
AC = accession number
sequences of given accession number.
F = file name
sequences whose names are in a specified file.
FA = file name
sequences whose accesion numbers are in a specified file.

Value

A list with the following components:

bank the name of the bank that has been choosen by choosebank.socket
call original call
name list name
req a list of sequence names that fit the required criteria

Note

Most of the documentation was imported from ACNUC help files written by Manolo Gouy

Author(s)

J.R. Lobry & D. Charif

References

To get the release date and content of all the databases located at the pbil, please look at the following url: http://pbil.univ-lyon1.fr/search/releases.php
Gouy, M., Milleret, F., Mugnier, C., Jacobzone, M., Gautier,C. (1984) ACNUC: a nucleic acid sequence data base and analysis system. Nucl. Acids Res., 12:121-127.
Gouy, M., Gautier, C., Attimonelli, M., Lanave, C., Di Paola, G. (1985) ACNUC - a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage. Comput. Appl. Biosci., 3:167-172.
Gouy, M., Gautier, C., Milleret, F. (1985) System analysis and nucleic acid sequence banks. Biochimie, 67:433-436.

To have an overview of the seqinR's functionnality, please consult this vignette: Charif, D., Lobry, J.R. (2005) SeqinR: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. Springer Verlag, Biological and Medical Physics/Biomedical Series, in preparation.

See Also

choosebank, getSequence, plot.SeqAcnucWeb

Examples

 ## Not run: s = choosebank("genbank")
 ## Not run: query(s$socket,"ecoli","sp=escherichia coli@")
 ## Not run: ecoli
 # To have the 4 first names of the sequence
 ## Not run: ecoli$req[1:4]
 ## Not run: ecoli$req[[5]]
 ## Not run: ecoli$call

[Package seqinr version 1.0-2 Index]