textcat_profile_db {textcat} | R Documentation |
Create n-gram profile dbs for text categorization.
textcat_profile_db(x, id, ...)
x |
a character vector of text documents, or an R object of text
documents extractable via as.character .
|
id |
a character vector giving the categories of the texts.
Recycled to the length of x .
|
... |
further arguments specifying the options used for creating
the n-gram profiles, see textcat_options for the
(current) default options. The names of the arguments are partially
matched against the names of the defaults, and used for the options
instead in case of unique matches.
|
The text documents are split according to the given categories, and
n-gram profiles are computed via textcnt
in package
tau, with options n
, split
and useBytes
corresponding to the respective arguments, and option reduce
setting argument marker
as needed. N-grams listed in option
ignore
are removed, and only the most frequent remaining ones
retained, with the maximal number given by option size
. The
options employed for building the db are stored in the db.
There is a c
method for combining profile dbs provided
that these have identical options.
Unless the profile db uses bytes rather than characters (i.e., option
bytes
is TRUE
), the text documents in x
should be
encoded in UTF-8.