The Fifth Elephant 2012

Finding the elephant in the data.

What are your users doing on your website or in your store? How do you turn the piles of data your organization generates into actionable information? Where do you get complementary data to make yours more comprehensive? What tech, and what techniques?

The Fifth Elephant is a two day conference on big data.

Early Geek tickets are available from fifthelephant.doattend.com.

The proposal funnel below will enable you to submit a session and vote on proposed sessions. It is a good practice introduce yourself and share details about your work as well as the subject of your talk while proposing a session.

Each community member can vote for or against a talk. A vote from each member of the Editorial Panel is equivalent to two community votes. Both types of votes will be considered for final speaker selection.

It’s useful to keep a few guidelines in mind while submitting proposals:

  1. Describe how to use something that is available under a liberal open source license. Participants can use this without having to pay you anything.

  2. Tell a story of how you did something. If it involves commercial tools, please explain why they made sense.

  3. Buy a slot to pitch whatever commercial tool you are backing.

Speakers will get a free ticket to both days of the event. Proposers whose talks are not on the final schedule will be able to purchase tickets at the Early Geek price of Rs. 1800.

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Jaidev Deshpande

@jaidevd

Exploratory Data Analysis with Python

Submitted Apr 25, 2012

Objectives:

  1. Learning how to find general details about a dataset before jumping on to the machine learning / big data bandwagon. (I’m calling these things ‘bandwagon’ because they are incredibly powerful, and in many cases, the application might not warrant a full scale use of such tools.)
  2. Learning to decide which tools are best for taking apart a big dataset.
  3. Understanding why getting a general feel of the data is necessary before thinking up models to analyze that data.

Outline

So we have a large data file. We might not know what to do with it. We most probably are looking for patterns and trends. With a multitude of data analysis tools and algorithms at our disposal, we are often left wondering as to what’s the right thing to ask of the data.

Exploratory data analysis is a field which offers tools and algorithms for the most broad, general look at a piece of data. It is after performing this sort of a global analysis on the data that we can go ahead and think about building a model to describe the data. This tutorial offers insights into the prerequisites for building such models, and having gained those, what all one could do with the model.

The tutorial will seek to answer questions like:

  • What’s the best way to cluster / classify a given dataset?
  • What does the data ‘look’ like?
  • How has the dataset evolved over time?
  • How do I know that I have inferred all I can from the dataset?
  • I see some peculiar trends in the dataset. What might have caused these?
  • Do all these questions motivate a good machine learning problem?

Requirements

  1. A basic knowledge of Python. (Knowing how to use the numpy.ndarray object will be a plus.)
  2. Basic probability and statistics.
  3. Basic knowledge of popular data formats.
  4. File handling and I/O.
  5. Preferably a laptop with the free version of the Enthought Python Distribution installed. (One step solution to everything you’ll need in Python for scientific computing.)

Speaker bio

I am an electrical engineering undergrad at VIIT, Pune. I’ve been working as a research assistant in the fields of machine learning and signal processing. I am currently working as an intern at Enthought, Inc, where I work on data analysis and visualization. I also contribute code and documentation to the SciPy project.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more