The dataset of interest is related to the direct marketing campaign of a Portuguese banking institution. Each entry in the dataset is a specific phone call to a client. The dataset contains data on the bank client in question as well as the client’s previous contact with the bank and other social and economic context attributes. We’re doing classification on the dataset, as the response variable we’re concerned with is whether the client of interest subscribed to a term deposit during the phone call (yes or no).
data <- read.table("/Users/Jake/Documents/STAT 434/Final Project/bank-full.csv", header = TRUE, sep = ";")
head(data)
## age job marital education default balance housing loan contact
## 1 58 management married tertiary no 2143 yes no unknown
## 2 44 technician single secondary no 29 yes no unknown
## 3 33 entrepreneur married secondary no 2 yes yes unknown
## 4 47 blue-collar married unknown no 1506 yes no unknown
## 5 33 unknown single unknown no 1 no no unknown
## 6 35 management married tertiary no 231 yes no unknown
## day month duration campaign pdays previous poutcome y
## 1 5 may 261 1 -1 0 unknown no
## 2 5 may 151 1 -1 0 unknown no
## 3 5 may 76 1 -1 0 unknown no
## 4 5 may 92 1 -1 0 unknown no
## 5 5 may 198 1 -1 0 unknown no
## 6 5 may 139 1 -1 0 unknown no
require(ggplot2)
## Loading required package: ggplot2
# THIS CODE IS COMMENTED OUT BECAUSE IT TAKES UP TOO MUCH SPACE
#for (val in names(data)) {
# print(ggplot(data = data, aes_string(x = val)) + geom_histogram(stat="count"))
#}
pr.out <- prcomp(data[c("age","balance","day","duration","campaign","pdays")], scale = TRUE)
pr.var <- pr.out$sdev ^2
pve <- pr.var/sum(pr.var )
plot(pve , xlab=" Principal Component ", ylab=" Proportion of
Variance Explained ", ylim=c(0,1) ,type='b')
It doesn’t seem like we can use PCA to reduce the dimensionality of the data, because all of the principal components explain about the same variance.
# We'll use 4 different sizes of training data: 10k, 20k, 30, and 40k
set.seed(1)
n <- nrow(data)
indices10k <- sample(1:n, 10000, replace = FALSE)
train10k <- data[indices10k,]
test10k <- data[-1*indices10k,]
set.seed(1)
n <- nrow(data)
indices20k <- sample(1:n, 20000, replace = FALSE)
train20k <- data[indices20k,]
test20k <- data[-1*indices20k,]
set.seed(1)
n <- nrow(data)
indices30k <- sample(1:n, 30000, replace = FALSE)
train30k <- data[indices30k,]
test30k <- data[-1*indices30k,]
set.seed(1)
n <- nrow(data)
indices40k <- sample(1:n, 40000, replace = FALSE)
train40k <- data[indices40k,]
test40k <- data[-1*indices40k,]
# 10k Training Size
glm.fits <- glm(y~., data = train10k, family = "binomial")
glm.pred <- ifelse(predict(glm.fits, test10k) > 0.5, "yes", "no")
1 - mean(glm.pred == test10k$y)
## [1] 0.1000824
# 20k Training Size
glm.fits <- glm(y~., data = train20k, family = "binomial")
glm.pred <- ifelse(predict(glm.fits, test20k) > 0.5, "yes", "no")
1 - mean(glm.pred == test20k$y)
## [1] 0.09892507
# 30k Training Size
glm.fits <- glm(y~., data = train30k, family = "binomial")
glm.pred <- ifelse(predict(glm.fits, test30k) > 0.5, "yes", "no")
1 - mean(glm.pred == test30k$y)
## [1] 0.09716652
# 40k Training Size
glm.fits <- glm(y~., data = train40k, family = "binomial")
glm.pred <- ifelse(predict(glm.fits, test40k) > 0.5, "yes", "no")
1 - mean(glm.pred == test40k$y)
## [1] 0.0994051
require(MASS)
## Loading required package: MASS
# 10k Training Size
lda.fit <- lda(y~., data = train10k)
lda.pred <- predict(lda.fit, test10k)$class
1 - mean(lda.pred == test10k$y)
## [1] 0.09883275
# 20k Training Size
lda.fit <- lda(y~., data = train20k)
lda.pred <- predict(lda.fit, test20k)$class
1 - mean(lda.pred == test20k$y)
## [1] 0.09654516
# 30k Training Size
lda.fit <- lda(y~., data = train30k)
lda.pred <- predict(lda.fit, test30k)$class
1 - mean(lda.pred == test30k$y)
## [1] 0.09644336
# 40k Training Size
lda.fit <- lda(y~., data = train40k)
lda.pred <- predict(lda.fit, test40k)$class
1 - mean(lda.pred == test40k$y)
## [1] 0.09710228
# 10k Training Size
qda.fit <- qda(y~., data = train10k)
qda.pred <- predict(qda.fit, test10k)$class
1 - mean(qda.pred == test10k$y)
## [1] 0.1294766
# 20k Training Size
qda.fit <- qda(y~., data = train20k)
qda.pred <- predict(qda.fit, test20k)$class
1 - mean(qda.pred == test20k$y)
## [1] 0.129507
# 30k Training Size
qda.fit <- qda(y~., data = train30k)
qda.pred <- predict(qda.fit, test30k)$class
1 - mean(qda.pred == test30k$y)
## [1] 0.1263559
# 40k Training Size
qda.fit <- qda(y~., data = train40k)
qda.pred <- predict(qda.fit, test40k)$class
1 - mean(qda.pred == test40k$y)
## [1] 0.1279985
require(class)
## Loading required package: class
cv.knn <- function(trainData, ks) {
train.X <- cbind(trainData$age,trainData$balance,trainData$day,trainData$duration,trainData$campaign,trainData$pdays)
train.Y <- trainData$y
cvMSE <- rep(NA, length(ks))
set.seed(1)
splits <- sample(1:nrow(trainData), replace = FALSE)
cv.fold <- 10
for (ix in 1:length(ks)) {
cvErrors <- rep(NA, cv.fold)
for (cvrep in 1:cv.fold) {
splitsix <- splits[(cvrep*nrow(trainData)/cv.fold - (nrow(trainData)/cv.fold - 1)):(cvrep*nrow(trainData)/cv.fold)]
trainSub <- trainData[-1*splitsix,]
testSub <- trainData[splitsix,]
# print(nrow(trainSub))
# print(nrow(testSub))
train.X <- cbind(trainSub$age,trainSub$balance,trainSub$day,trainSub$duration,trainSub$campaign,trainSub$pdays)
train.Y <- trainSub$y
test.X <- cbind(testSub$age,testSub$balance,testSub$day,testSub$duration,testSub$campaign,testSub$pdays)
set.seed(1)
pred <- knn(train.X, test.X, train.Y, k=ks[ix])
cvErrors[cvrep] <- 1 - mean((testSub$y == pred))
}
print(ix)
cvMSE[ix] <- mean(cvErrors)
}
return(cvMSE)
}
ks <- seq(4, 24, 2)
knn.10k <- cv.knn(train10k, ks)
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
plot(knn.10k, type = "l")
knn.20k <- cv.knn(train20k, ks)
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
plot(knn.20k, type = "l")
knn.30k <- cv.knn(train30k, ks)
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
plot(knn.30k, type = "l")
knn.40k <- cv.knn(train40k, ks)
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
plot(knn.40k, type = "l")
c((which(min(knn.10k) == knn.10k) * 2) + 2,min(knn.10k));
## [1] 12.0000 0.1169
c((which(min(knn.20k) == knn.20k) * 2) + 2,min(knn.20k));
## [1] 18.00000 0.11545
c((which(min(knn.30k) == knn.30k) * 2) + 2,min(knn.30k));
## [1] 24.0000 0.1136
c((which(min(knn.40k) == knn.40k) * 2) + 2,min(knn.40k));
## [1] 16.000000 0.112625
The lowest cross-validation error we get is for the 40k training set with k = 16.
trainSub <- train10k
testSub <- test10k
train.X <- cbind(trainSub$age,trainSub$balance,trainSub$day,trainSub$duration,trainSub$campaign,trainSub$pdays)
train.Y <- trainSub$y
test.X <- cbind(testSub$age,testSub$balance,testSub$day,testSub$duration,testSub$campaign,testSub$pdays)
set.seed(1)
pred <- knn(train.X, test.X, train.Y, 16)
1 - mean((testSub$y == pred))
## [1] 0.1145097
trainSub <- train20k
testSub <- test20k
train.X <- cbind(trainSub$age,trainSub$balance,trainSub$day,trainSub$duration,trainSub$campaign,trainSub$pdays)
train.Y <- trainSub$y
test.X <- cbind(testSub$age,testSub$balance,testSub$day,testSub$duration,testSub$campaign,testSub$pdays)
set.seed(1)
pred <- knn(train.X, test.X, train.Y, 16)
1 - mean((testSub$y == pred))
## [1] 0.1133235
trainSub <- train30k
testSub <- test30k
train.X <- cbind(trainSub$age,trainSub$balance,trainSub$day,trainSub$duration,trainSub$campaign,trainSub$pdays)
train.Y <- trainSub$y
test.X <- cbind(testSub$age,testSub$balance,testSub$day,testSub$duration,testSub$campaign,testSub$pdays)
set.seed(1)
pred <- knn(train.X, test.X, train.Y, 16)
1 - mean((testSub$y == pred))
## [1] 0.1109066
trainSub <- train40k
testSub <- test40k
train.X <- cbind(trainSub$age,trainSub$balance,trainSub$day,trainSub$duration,trainSub$campaign,trainSub$pdays)
train.Y <- trainSub$y
test.X <- cbind(testSub$age,testSub$balance,testSub$day,testSub$duration,testSub$campaign,testSub$pdays)
set.seed(1)
pred <- knn(train.X, test.X, train.Y, 16)
1 - mean((testSub$y == pred))
## [1] 0.1101516
library(tree)
# 10k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 10000)
data.train <- data[train,]
data.test <- data[-train,]
# Create initial tree
tree.data <- tree(factor(y)~., data.train)
# Analyze test error
preds <- predict(tree.data, newdata = data.test, type = "class")
conf.mat <- table(preds, data.test$y)
conf.mat
##
## preds no yes
## no 30437 3182
## yes 689 903
sum(preds != data.test$y)/length(preds)
## [1] 0.1099372
# Prune tree
cv.data <- cv.tree(tree.data, FUN = prune.misclass)
plot(cv.data$size, cv.data$dev, type = 'b')
# Analyze pruned results
prune.data = prune.tree(tree.data, best = 5)
preds <- predict(tree.data, newdata = data.test, type = "class")
sum(preds != data.test$y)/length(preds)
## [1] 0.1099372
# 20k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 20000)
data.train <- data[train,]
data.test <- data[-train,]
# Create initial tree
tree.data <- tree(factor(y)~., data.train)
# Analyze test error
preds <- predict(tree.data, newdata = data.test, type = "class")
conf.mat <- table(preds, data.test$y)
conf.mat
##
## preds no yes
## no 21109 1473
## yes 1203 1426
sum(preds != data.test$y)/length(preds)
## [1] 0.1061441
# Prune tree
cv.data <- cv.tree(tree.data, FUN = prune.misclass)
plot(cv.data$size, cv.data$dev, type = 'b')
# Analyze pruned results
prune.data = prune.tree(tree.data, best = 5)
preds <- predict(tree.data, newdata = data.test, type = "class")
sum(preds != data.test$y)/length(preds)
## [1] 0.1061441
# 30k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 30000)
data.train <- data[train,]
data.test <- data[-train,]
# Create initial tree
tree.data <- tree(factor(y)~., data.train)
# Analyze test error
preds <- predict(tree.data, newdata = data.test, type = "class")
conf.mat <- table(preds, data.test$y)
conf.mat
##
## preds no yes
## no 12774 946
## yes 695 796
sum(preds != data.test$y)/length(preds)
## [1] 0.1078825
# Prune tree
cv.data <- cv.tree(tree.data, FUN = prune.misclass)
plot(cv.data$size, cv.data$dev, type = 'b')
# Analyze pruned results
prune.data = prune.tree(tree.data, best = 4)
preds <- predict(tree.data, newdata = data.test, type = "class")
sum(preds != data.test$y)/length(preds)
## [1] 0.1078825
# 40k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 40000)
data.train <- data[train,]
data.test <- data[-train,]
# Create and plot initial tree, since this ends up our best tree
tree.data <- tree(factor(y)~., data.train)
plot(tree.data)
text(tree.data, pretty = 0)
# Analyze test error
preds <- predict(tree.data, newdata = data.test, type = "class")
conf.mat <- table(preds, data.test$y)
conf.mat
##
## preds no yes
## no 4359 293
## yes 245 314
sum(preds != data.test$y)/length(preds)
## [1] 0.1032431
# Prune tree
cv.data <- cv.tree(tree.data, FUN = prune.misclass)
plot(cv.data$size, cv.data$dev, type = 'b')
# Analyze pruned results
prune.data = prune.tree(tree.data, best = 3)
preds <- predict(tree.data, newdata = data.test, type = "class")
sum(preds != data.test$y)/length(preds)
## [1] 0.1032431
plot(prune.data)
text(prune.data, pretty = 0)
# 10k Training Size
data$y <- ifelse(data$y == "yes", 1, 0)
set.seed(1)
train <- sample(1:nrow(data), size = 10000)
data.train <- data[train,]
data.test <- data[-train,]
library(gbm)
## Loading required package: survival
## Loading required package: lattice
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
lambda.seq <- c(.0001, .001, seq(from = .1, to = .5, by = .05))
train.errs <- numeric(length(lambda.seq))
for (i in 1:length(lambda.seq)) {
boost.data <- gbm(y~., data = data.train, distribution = "bernoulli",
n.trees = 1000, shrinkage = lambda.seq[i])
yhat.boost <- predict(boost.data, newdata = data.train, n.trees = 1000, type = "response")
train.errs[i] <- mean((yhat.boost - data.train$y))
}
plot(lambda.seq, train.errs, type = "b")
## Test error
boost.data <- gbm(y~., data = data.train, distribution = "bernoulli",
n.trees = 1000, shrinkage = .5)
yhat.boost <- predict(boost.data, newdata = data.test, n.trees = 1000, type = "response")
mean((yhat.boost - data.test$y))
## [1] 0.006252458
# 20k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 20000)
data.train <- data[train,]
data.test <- data[-train,]
library(gbm)
lambda.seq <- c(.0001, .001, seq(from = .1, to = .5, by = .05))
train.errs <- numeric(length(lambda.seq))
for (i in 1:length(lambda.seq)) {
boost.data <- gbm(y~., data = data.train, distribution = "bernoulli",
n.trees = 1000, shrinkage = lambda.seq[i])
yhat.boost <- predict(boost.data, newdata = data.train, n.trees = 1000, type = "response")
train.errs[i] <- mean((yhat.boost - data.train$y))
}
plot(lambda.seq, train.errs, type = "b")
## Test error
boost.data <- gbm(y~., data = data.train, distribution = "bernoulli",
n.trees = 1000, shrinkage = .5)
yhat.boost <- predict(boost.data, newdata = data.test, n.trees = 1000, type = "response")
mean((yhat.boost - data.test$y))
## [1] 0.003690387
# 30k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 30000)
data.train <- data[train,]
data.test <- data[-train,]
library(gbm)
lambda.seq <- c(.0001, .001, seq(from = .1, to = .5, by = .05))
train.errs <- numeric(length(lambda.seq))
for (i in 1:length(lambda.seq)) {
boost.data <- gbm(y~., data = data.train, distribution = "bernoulli",
n.trees = 1000, shrinkage = lambda.seq[i])
yhat.boost <- predict(boost.data, newdata = data.train, n.trees = 1000, type = "response")
train.errs[i] <- mean((yhat.boost - data.train$y))
}
plot(lambda.seq, train.errs, type = "b")
## Test error
boost.data <- gbm(y~., data = data.train, distribution = "bernoulli",
n.trees = 1000, shrinkage = .5)
yhat.boost <- predict(boost.data, newdata = data.test, n.trees = 1000, type = "response")
mean((yhat.boost - data.test$y))
## [1] 0.003420768
# 40k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 40000)
data.train <- data[train,]
data.test <- data[-train,]
library(gbm)
lambda.seq <- c(.0001, .001, seq(from = .1, to = .5, by = .05))
train.errs <- numeric(length(lambda.seq))
for (i in 1:length(lambda.seq)) {
boost.data <- gbm(y~., data = data.train, distribution = "bernoulli",
n.trees = 1000, shrinkage = lambda.seq[i])
yhat.boost <- predict(boost.data, newdata = data.train, n.trees = 1000, type = "response")
train.errs[i] <- mean((yhat.boost - data.train$y))
}
plot(lambda.seq, train.errs, type = "b")
## Test error
boost.data <- gbm(y~., data = data.train, distribution = "bernoulli",
n.trees = 1000, shrinkage = .5)
yhat.boost <- predict(boost.data, newdata = data.test, n.trees = 1000, type = "response")
mean((yhat.boost - data.test$y))
## [1] 0.001897901
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
# 10k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 10000)
data.train <- data[train,]
data.test <- data[-train,]
data.rf <- randomForest(factor(y)~., data.train,
importance = TRUE)
## Test error
yhat.rf <- predict(data.rf, newdata = data.test)
mean((yhat.rf != data.test$y))
## [1] 0.09352191
# 20k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 20000)
data.train <- data[train,]
data.test <- data[-train,]
data.rf <- randomForest(factor(y)~., data.train,
importance = TRUE)
## Test error
yhat.rf <- predict(data.rf, newdata = data.test)
mean((yhat.rf != data.test$y))
## [1] 0.09043671
# 30k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 30000)
data.train <- data[train,]
data.test <- data[-train,]
data.rf <- randomForest(factor(y)~., data.train,
importance = TRUE)
## Test error
yhat.rf <- predict(data.rf, newdata = data.test)
mean((yhat.rf != data.test$y))
## [1] 0.08986917
# 40k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 40000)
data.train <- data[train,]
data.test <- data[-train,]
data.rf <- randomForest(factor(y)~., data.train,
importance = TRUE)
## Test error
yhat.rf <- predict(data.rf, newdata = data.test)
mean((yhat.rf != data.test$y))
## [1] 0.0880829
# 10k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 10000)
data.train <- data[train,]
data.test <- data[-train,]
bag.data <- randomForest(factor(y)~., data = data.train, mtry = 16, importance = TRUE)
## Test error
yhat.bag <- predict(bag.data, newdata = data.test)
mean(yhat.bag != data.test$y)
## [1] 0.09627673
# 20k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 20000)
data.train <- data[train,]
data.test <- data[-train,]
bag.data <- randomForest(factor(y)~., data = data.train, mtry = 16, importance = TRUE)
## Test error
yhat.bag <- predict(bag.data, newdata = data.test)
mean(yhat.bag != data.test$y)
## [1] 0.09162667
# 30k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 30000)
data.train <- data[train,]
data.test <- data[-train,]
bag.data <- randomForest(factor(y)~., data = data.train, mtry = 16, importance = TRUE)
## Test error
yhat.bag <- predict(bag.data, newdata = data.test)
mean(yhat.bag != data.test$y)
## [1] 0.0921044
# 40k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 40000)
data.train <- data[train,]
data.test <- data[-train,]
bag.data <- randomForest(factor(y)~., data = data.train, mtry = 16, importance = TRUE)
## Test error
yhat.bag <- predict(bag.data, newdata = data.test)
mean(yhat.bag != data.test$y)
## [1] 0.09000192
library(nnet)
# 10k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 10000)
data.train <- data[train,]
data.test <- data[-train,]
nnet.data <- nnet(factor(y)~., data = data, subset = train, linout = FALSE, size=20)
## # weights: 881
## initial value 7302.123925
## iter 10 value 3504.903033
## iter 20 value 3132.736507
## iter 30 value 3068.240038
## iter 40 value 2968.288583
## iter 50 value 2936.301229
## iter 60 value 2887.779655
## iter 70 value 2882.627130
## iter 80 value 2860.986507
## iter 90 value 2835.528970
## iter 100 value 2825.902203
## final value 2825.902203
## stopped after 100 iterations
## Test error
yhat.nnet <- predict(nnet.data, newdata = data.test, type = "class")
mean(yhat.nnet != data.test$y)
## [1] 0.1157309
# 20k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 20000)
data.train <- data[train,]
data.test <- data[-train,]
nnet.data <- nnet(factor(y)~., data = data, subset = train, linout = FALSE, size=20)
## # weights: 881
## initial value 8251.537236
## iter 10 value 6977.562414
## iter 20 value 6151.321390
## iter 30 value 6010.493291
## iter 40 value 5947.463387
## iter 50 value 5848.033993
## iter 60 value 5736.437559
## iter 70 value 5701.432881
## iter 80 value 5666.540647
## iter 90 value 5625.813420
## iter 100 value 5598.573592
## final value 5598.573592
## stopped after 100 iterations
## Test error
yhat.nnet <- predict(nnet.data, newdata = data.test, type = "class")
mean(yhat.nnet != data.test$y)
## [1] 0.1150688
# 30k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 30000)
data.train <- data[train,]
data.test <- data[-train,]
nnet.data <- nnet(factor(y)~., data = data, subset = train, linout = FALSE, size=20)
## # weights: 881
## initial value 44160.413282
## iter 10 value 10315.407305
## iter 20 value 9640.055215
## iter 30 value 9237.976556
## iter 40 value 9160.481917
## iter 50 value 9100.269366
## iter 60 value 9077.683004
## iter 70 value 9003.228770
## iter 80 value 8996.017418
## iter 90 value 8976.725883
## iter 100 value 8947.892690
## final value 8947.892690
## stopped after 100 iterations
## Test error
yhat.nnet <- predict(nnet.data, newdata = data.test, type = "class")
mean(yhat.nnet != data.test$y)
## [1] 0.1135363
# 40k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 40000)
data.train <- data[train,]
data.test <- data[-train,]
nnet.data <- nnet(factor(y)~., data = data, subset = train, linout = FALSE, size=20)
## # weights: 881
## initial value 85426.851429
## iter 10 value 14224.804688
## iter 20 value 12847.641623
## iter 30 value 12567.176545
## iter 40 value 12180.951130
## iter 50 value 11820.441472
## iter 60 value 11701.455412
## iter 70 value 11621.342632
## iter 80 value 11544.775386
## iter 90 value 11439.165800
## iter 100 value 11390.255465
## final value 11390.255465
## stopped after 100 iterations
## Test error
yhat.nnet <- predict(nnet.data, newdata = data.test, type = "class")
mean(yhat.nnet != data.test$y)
## [1] 0.1162925
library(e1071)
# 10k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 10000)
data.train <- data[train,]
data.test <- data[-train,]
tune.out <- tune(svm, factor(y)~., data = data.train, kernel = "polynomial",
ranges = list(cost = c(.01, .1, .5, 1, 5, 10)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 10
##
## - best performance: 0.1128
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.1205 0.009478045
## 2 0.10 0.1202 0.009953224
## 3 0.50 0.1192 0.009874771
## 4 1.00 0.1186 0.010068653
## 5 5.00 0.1150 0.010561986
## 6 10.00 0.1128 0.010075272
svcfit <- svm(factor(y)~., data = data.train, kernel = "polynomial",
cost = 10)
## Test error
train.preds <- predict(svcfit, data.train)
mean(train.preds != data.train$y)
## [1] 0.1023
# 20k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 20000)
data.train <- data[train,]
data.test <- data[-train,]
tune.out <- tune(svm, factor(y)~., data = data.train, kernel = "polynomial",
ranges = list(cost = c(.01, .1, .5, 1, 5, 10)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 10
##
## - best performance: 0.1092
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.11960 0.006732178
## 2 0.10 0.11975 0.006558497
## 3 0.50 0.11890 0.006694940
## 4 1.00 0.11775 0.006925196
## 5 5.00 0.11245 0.006152913
## 6 10.00 0.10920 0.005769652
svcfit <- svm(factor(y)~., data = data.train, kernel = "polynomial",
cost = .5)
## Test error
train.preds <- predict(svcfit, data.train)
mean(train.preds != data.train$y)
## [1] 0.11745
# 30k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 30000)
data.train <- data[train,]
data.test <- data[-train,]
tune.out <- tune(svm, factor(y)~., data = data.train, kernel = "polynomial",
ranges = list(cost = c(.01, .1, .5, 1, 5, 10)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 10
##
## - best performance: 0.1053667
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.1183000 0.005339002
## 2 0.10 0.1183000 0.005219739
## 3 0.50 0.1168333 0.005340851
## 4 1.00 0.1152000 0.005427866
## 5 5.00 0.1091667 0.005268706
## 6 10.00 0.1053667 0.004905150
svcfit <- svm(factor(y)~., data = data.train, kernel = "polynomial",
cost = .5)
## Test error
train.preds <- predict(svcfit, data.train)
mean(train.preds != data.train$y)
## [1] 0.1158333
# 40k Training Size
set.seed(1)
train <- sample(1:nrow(data), size = 40000)
data.train <- data[train,]
data.test <- data[-train,]
tune.out <- tune(svm, factor(y)~., data = data.train, kernel = "polynomial",
ranges = list(cost = c(.01, .1, .5, 1, 5, 10)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 10
##
## - best performance: 0.102575
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.117175 0.004784306
## 2 0.10 0.117175 0.004663764
## 3 0.50 0.115300 0.003902990
## 4 1.00 0.112700 0.004212745
## 5 5.00 0.105850 0.004062361
## 6 10.00 0.102575 0.004190018
svcfit <- svm(factor(y)~., data = data.train, kernel = "polynomial",
cost = .5)
## Test error
train.preds <- predict(svcfit, data.train)
mean(train.preds != data.train$y)
## [1] 0.113675
We found that boosting models performed far better than all the other classification methods, as its test misclassification rates for varying training data sizes were all less than 1%, while the other models all had test misclassification rates above 9%. The best boosting model we created had a training data size of 40 thousand, and its test misclassification rate was .1897901 %. For all of the models, the larger training size did slightly reduce test error, but not by much (only by a few tenths of a percent).
It is interesting that 88.3% of the phone calls in the dataset did not result in the client subscribing for a term deposit. Thus, by simply predicting that every phone call will not result in a subscription, using this dataset, we would obtain only a 11.7% misclassification rate, which is not far greater than many of our models’ test errors. That being said, our models with 11% and 10% misclassification rates shouldn’t necessarily be thrown away, as correctly predicting more yes’s while sacrificing misclassifying some no’s may be more advantageous in this scenario than completely missing all yes’s.
From our exploratory analysis we found that:
1. The longer the duration of the call, the more likely a person was to open an account.
2. Calls in September, October, December, or March had a much higher chance of having a person open an account than other months.
3. People over 60 are much more likely to open an account than people under 60.
4. If the outcome of the previous marketing campaign was good, the chance of the customer opening an account was higher.
5. The dimensionality of the data can't be reduced without throwing away important information.