Machine Learning using R : Crash course in Classification Methods

This submission has been added to the schedule

Machine Learning using R : Crash course in Classification Methods

Submitted Jun 1, 2014

Section: Workshops Technical level: Beginner

The aim is to provide the attendees with an overview (implementation-wise) of some of the major classification methods using R. The focus of the workshop will be on breadth rather than depth. A lot of methods will be introduced, but their mathematical properties won’t be discussed in detail.

As a caveat, most of the real-life problems cannot be solved efficiently without further detailed understanding of these algorithms. But this workshop should give a quick and dirty start to solving the problems.

Target Audience: Beginner/Intermediate

Outline

The following topics would be covered. The format would be a bit of theory and then implementation using R

Introduction to Machine learning

Types of Learning (Supervised/Unsupervised/Reinforced)
Introduction to Generalization
Train/Test/Validation Datasets
Bias – Variance tradeoff
Overfitting
Cross-validation
Regularization
Grid Search
Hyperparameter Optimization
Feature Selection/Transformation
a. Greedy feature selection (forward, backward, stepwise)
b. Non-linear transformations, Kernels

Classification Techniques covered:

Linear Regression
Logistic Regression
LASSO, Ridge and Elastic net regression
kNN
Discriminant Analysis
Decision Trees, CART, CHAID
Support Vector Machines
Naïve Bayes
Ensemble Methods
a. Boosting
b. Bagging
c. Random Forest
d. Regularized Random Forest
e. Gradient Boosting Machines

Unsupervised learning techniques covered:

Dimensionality Reduction: Principal Component Analysis
K-Means clustering

Illustrating common pitfalls

Data snooping
Occam’s Razor

Big Data Analytics (*need AWS credit for implementation. And time permitting)

Introduction to Big Data and Hadoop
R and Big Data
a. Hadoop
b. Linear Model
c. Random Forest

Requirements

Prerequisites:

The attendee should have an aptitude for solving data mining/machine learning problems.
Preferred if attendees read a bit about R before coming (please see links below)

Hardware:
Any modern laptop configuration would work. It is good to have atleast 4+ GB of RAM with a dual core/quad core machine.

Software:

Install latest R version from CRAN website : http://cran.r-project.org/
Install R Studio : https://www.rstudio.com/
For Hadoop, need AWS credit.

Dataset and required R packages
Please download data from the following location:
https://dl.dropboxusercontent.com/u/72650512/5th_el_train.csv.zip

Please install the following R packages:
(To do: open R Studio, and enter install.packages(“package_name”))

caret
data.table
e1071
foba
gbm
glmnet
mboost
nnet
gbm
randomForest
RRF

Update* (22 July):
Additional packages (please install, if possible)
1)data.table
2)sqldf
3)ROCR
4)kernlab
5)rpart

Speaker bio

Data Analytics professional at Cisco Systems India Pvt Ltd.

Links

Code used for the workshop: https://dl.dropboxusercontent.com/u/72650512/5el Workshop.R
Presentation used for the workshop: https://dl.dropboxusercontent.com/u/72650512/workshop_presentation.pdf
Also available at: https://github.com/rouseguy/workshop
Readings/Videos:
1. Andrew Ng’s Coursera course: https://class.coursera.org/ml-005/
1. Yaser’s Caltech course : http://work.caltech.edu/telecourse.html
1. Hastie’s Introduction to Statistical Learning with R: http://www-bcf.usc.edu/~gareth/ISL/
1. R programming: https://class.coursera.org/rprog-004
1. Intro to R Book: http://cran.r-project.org/doc/manuals/R-intro.pdf

The Fifth Elephant 2014