spam {ElemStatLearn}R Documentation

Email Spam Data

Description

SPAM E-mail Database. See Details below.

Usage

data(spam)

Format

A data frame with 4601 observations on the following 58 variables.

A.1
a numeric vector
A.2
a numeric vector
A.3
a numeric vector
A.4
a numeric vector
A.5
a numeric vector
A.6
a numeric vector
A.7
a numeric vector
A.8
a numeric vector
A.9
a numeric vector
A.10
a numeric vector
A.11
a numeric vector
A.12
a numeric vector
A.13
a numeric vector
A.14
a numeric vector
A.15
a numeric vector
A.16
a numeric vector
A.17
a numeric vector
A.18
a numeric vector
A.19
a numeric vector
A.20
a numeric vector
A.21
a numeric vector
A.22
a numeric vector
A.23
a numeric vector
A.24
a numeric vector
A.25
a numeric vector
A.26
a numeric vector
A.27
a numeric vector
A.28
a numeric vector
A.29
a numeric vector
A.30
a numeric vector
A.31
a numeric vector
A.32
a numeric vector
A.33
a numeric vector
A.34
a numeric vector
A.35
a numeric vector
A.36
a numeric vector
A.37
a numeric vector
A.38
a numeric vector
A.39
a numeric vector
A.40
a numeric vector
A.41
a numeric vector
A.42
a numeric vector
A.43
a numeric vector
A.44
a numeric vector
A.45
a numeric vector
A.46
a numeric vector
A.47
a numeric vector
A.48
a numeric vector
A.49
a numeric vector
A.50
a numeric vector
A.51
a numeric vector
A.52
a numeric vector
A.53
a numeric vector
A.54
a numeric vector
A.55
a numeric vector
A.56
a numeric vector
A.57
a numeric vector
spam
Factor w/ 2 levels "email", "spam"

Details

The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography... Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

For background on spam: Cranor, Lorrie F., LaMacchia, Brian A. Spam! Communications of the ACM, 41(8):74-83, 1998.

Attribute Information: The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:

48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.

6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail

1 continuous real [1,...] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail

1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

Source

(a) Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304 (b) Donor: George Forman (gforman at nospam hpl.hp.com) 650-857-7835 (c) Generated: June-July 1999

References

http://www.ics.uci.edu/~mlearn/MLRepository.html

Examples

head(str(spam))

[Package ElemStatLearn version 0.1-6 Index]