textcat {textcat}R Documentation

N-Gram Based Text Categorization

Description

Categorize texts by finding the closest n-gram reference profile.

Usage

textcat(x, p = ECIMCI_profiles, method = "CT")

Arguments

x a character vector, or an object coercible to this using as.character.
p a textcat profile db (see textcat_profile_db).
method a character string specifying a built-in method, or a used-defined function for computing distances between n-gram profiles. See Details for available built-in methods.

Details

Currently, the following distance methods are available.

"CT":
the out-of-place measure of Cavnar and Trenkle.
"ranks":
a variant of the Cavnar/Trenkle measure based on the aggregated absolute difference of the ranks of the combined n-grams in the given text and the reference profile.
"ALPD":
the sum of the absolute differences in n-gram log frequencies.
"KLI":
the Kullback-Leibler I-divergence I(p, q) = sum_i p_i log(p_i/q_i) of the n-gram frequency distributions p and q of the given text and the reference profile.
"KLJ":
the Kullback-Leibler J-divergence J(p, q) = sum_i (p_i - q_i) log(p_i/q_i), the symmetrized variant I(p, q) + I(q, p) of the I-divergences.
"JS":
the Jensen-Shannon divergence between the n-gram frequency distributions.

For the measures based on distances of frequency distributions, n-grams in the text and the reference profile are combined, and missing n-grams are given a small positive absolute frequency (currently, 1e-6).

For each given text, its n-gram profile is computed using the options in the reference profile db. Then, the distance between the profile and the reference profiles is computed, and the text is categorized into the category of the closest profile (if this is not unique, NA is obtained).

Unless the profile db uses bytes rather than characters, the texts in x should be encoded in UTF-8.

References

W. B. Cavnar and J. M. Trenkle (1994), N-Gram-Based Text Categorization. In ``Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval'', 161–175.

Examples

textcat(c("This is an english sentence.",
          "Das ist ein deutscher satz."))

[Package textcat version 0.0-1 Index]