textcat {textcat} | R Documentation |
Categorize texts by finding the closest n-gram reference profile.
textcat(x, p = ECIMCI_profiles, method = "CT")
x |
a character vector, or an object coercible to this using
as.character . |
p |
a textcat profile db (see textcat_profile_db ). |
method |
a character string specifying a built-in method, or a used-defined function for computing distances between n-gram profiles. See Details for available built-in methods. |
Currently, the following distance methods are available.
"CT"
:"ranks"
:"ALPD"
:"KLI"
:"KLJ"
:"JS"
:For the measures based on distances of frequency distributions, n-grams in the text and the reference profile are combined, and missing n-grams are given a small positive absolute frequency (currently, 1e-6).
For each given text, its n-gram profile is computed using the options
in the reference profile db. Then, the distance between the profile
and the reference profiles is computed, and the text is categorized
into the category of the closest profile (if this is not unique,
NA
is obtained).
Unless the profile db uses bytes rather than characters, the texts in
x
should be encoded in UTF-8.
W. B. Cavnar and J. M. Trenkle (1994), N-Gram-Based Text Categorization. In ``Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval'', 161–175.
textcat(c("This is an english sentence.", "Das ist ein deutscher satz."))