arrow_back Building a scalable Data Science Platform ( Luigi, Apache Spark, Pandas, Flask)
Introduction to Statistics and Basics of Mathematics for Data Science - the hacker's way
Submitted by Bargava Subramanian (@barsubra) on Tuesday, 7 June 2016
A lot many of us decided Math was our reckoning in our high school and ended up studying highly quantitative fields like engineering and computer science and some of us even specialized further with a Masters, including MBA. And yet here we are, a few years into our career and suddenly realizing the math basics isn’t as strong as what we thought it should have been.
Numerical literacy, including basic proficiency in math and stats, is a must for anyone pursuing a career in data science.
The goal of this workshop is to introduce some key concepts that get used repeatedly in data science applications. Our approach is what we call the “Hacker’s way”. Instead of going back to formulae and proofs, we teach the concepts by writing code. And in practical applications. Concepts don’t remain sticky if the usage is never taught.
The focus will be on depth rather than breadth. Three areas are chosen and will be covered to sufficient depth - 50% of the time will be on the concepts and 50% of the time will be spent coding them.
Target Audience for the workshop:
Our ideal attendee will be in one of the two categories:
a) Someone in IT with some background in programming who wants to pick the math needed for data science and get a flavor for different data science problems
b) Someone who is a beginner in data science or has been doing data analysis using MS Excel and wants to pick skills to take the next step in their data science career
Programming knowledge is mandatory. Attendee should, at the bare minimum, be able to write conditional statements, use loops, be comfortable writing functions and be able to understand code snippets and come up with programming logic.
This is a full-day workshop. The 6-hour workshop is roughly split into 3 major modules. Each module will introduce some math and then an application is introduced where the concepts learnt will be used.
Workshop Topics and Structure
Module 1: Basics of Statistics (Application: A/B Testing)
The first part of this module will introduce the basic concepts (mean, median, standard deviation, variance, probability distribution). Then, using A/B testing as application, hypothesis testing is introduced. At the end of this module, attendees will be able to understand what confidence intervals are, significance levels, confidence intervals, p-value and t-test.
Module 2: Basics of Linear Algebra (Application: Supervised Machine Learning: Linear Regression)
The first part of this module will introduce attendees to the world of linear algebra (vectors, matrices and operations on them). One of the simplest and most powerful supervised machine learning algorithm, linear regression, is introduced using an application where the attendees are taught how to build a predictive model to predict a continuous target variable. The various diagnostics from the linear model’s output are discussed.
Module 3: Basics of Linear Algebra -continued (Application: Unsupervised Machine Learning: Dimensionality Reduction)
In the first part of this module, eigen value and eigen vectors are introduced. Then an unsupervised machine learning algorithm, Principal Component Analysis, is introduced and an application of dimensionality reduction is implemented.
Depending on time and interest, one of the clustering algorithms - k-means clustering algorithm will be implemented.
Software Requirements for the Workshop:
We will be using Python data stack for the workshop.
Please install Ananconda for Python 3.5 for the workshop. That has everything we need for the workshop.
For attendees more curious, we will be using Jupyter Notebook as our IDE. We will be introducing
Data Repository for the Workshop:
The data necessary for the workshop will be available in the workshop’s github repository. Please download them before coming for the workshop. The repository for the workshop is:
The repository will be updated/available three days before the workshop (EoD 27th July 2016). Please refer to the repo and install the necessary requirements prior to the workshop. Installation support won’t be provided on the day of the workshop.
We expect participants to know programming and a bit of Python. Specifically, we expect participants to know the first three sections from this: http://anandology.com/python-practice-book/
Participants should bring their own laptops with the required softwares already installed. There will be no support to install the required softwares on the workshop day. Please post queries/issues on the friendsofhasgeek slack channel.
Amit Kapoor teaches the craft of telling visual stories with data. He conducts workshops and trainings on Data Science in Python and R, as well as on Data Visualisation topics. His background is in strategy consulting having worked with AT Kearney in India, then with Booz & Company in Europe and more recently for startups in Bangalore. He did his B.Tech in Mechanical Engineering from IIT, Delhi and PGDM (MBA) from IIM, Ahmedabad. You can find more about him at http://amitkaps.com/ and tweet him at @amitkaps.
Bargava Subramanian is a Data Scientist at Cisco Systems, India. He has 14 years of experience delivering business analytics solutions to Investment Banks, Entertainment Studios and High-Tech companies. He has given talks and conducted workshops on Data Science, Machine Learning, Deep Learning and Optimization in Python and R. He has a Masters in Statistics from University of Maryland, College Park, USA. He is an ardent NBA fan. You can tweet to him at @bargava.