textmatrix {lsa} | R Documentation |
Creates a document-term matrix from all textfiles in a given directory.
textmatrix( mydir, stemming=FALSE, language="german", minWordLength=2, minDocFreq=1, stopwords=NULL, vocabulary=NULL ) textvector( file, stemming=FALSE, language="german", minWordLength=2, minDocFreq=1, stopwords=NULL, vocabulary=NULL )
file |
filename (may include path). |
mydir |
the directory path (e.g., "corpus/texts/" ). |
stemming |
boolean indicating whether to reduce all terms to their wordstem. |
language |
specifies language for the stemming / stop-word-removal. |
minWordLength |
words with less than minWordLength characters will be ignored. |
minDocFreq |
words of a document appearing less than minDocFreq within that document will be ignored. |
stopwords |
a stopword list that contains terms the will be ignored. |
vocabulary |
if specified, only words in this term list will be used for building the matrix (`controlled vocabulary'). |
All documents in the specified directory are read and a matrix is composed. The matrix contains in every cell the exact number of appearances (i.e., the term frequency) of every word for all documents. If specified, simple text preprocessing mechanisms are applied (stemming, stopword filtering, wordlength cutoffs).
Stemming thereby uses porter's snowball stemmer (from package Rstem
).
There are two stopword lists included (for english and for german), which
are loaded on demand into the variables stopwords_de
and
stopwords_en
. They can be activated by calling data(stopwords_de)
or data(stopwords_en)
. Attention: the stopword lists have
to be already loaded when textmatrix()
is called.
textvector()
is a support function that creates a list of
term-in-document occurrences.
For every generated matrix, an own environment is added as an attribute which
holds the triples that are stored by setTriple()
and can be
retrieved with getTriple()
.
textmatrix |
the document-term matrix (incl. row and column names). |
Fridolin Wild fridolin.wild@wu-wien.ac.at
wordStem
, stopwords_de
, stopwords_en
, setTriple
, getTriple
# create some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") ) write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/") ) write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/") ) # read them, create a document-term matrix textmatrix(td) # read them, drop german stopwords data(stopwords_de) textmatrix(td, stopwords=stopwords_de) # read them based on a controlled vocabulary voc = c("dog", "mouse") textmatrix(td, vocabulary=voc, minWordLength=1) # clean up unlink(td, recursive=TRUE)