The ART of Data Mining - Practical Learnings from Real-world Data Mining applications
Submitted by Shailesh Kumar (@shkumar) on Tuesday, 3 June 2014
Machine Learning and data mining is part SCIENCE (ML algorithms, optimization), part ENGINEERING (large scale modeling, real-time decisions), part PROCESS (data understanding, feature engineering, modelling, evaluation, and deployment), and part ART. In this talk we will focus more on the "ART of data mining" - the little things that make the big difference in the quality and sophistication of machine learning models we build. Using real-world analytics problems from a variety of domains, we will share a number of practical learnings in:
(1) The art of understanding the data better - (e.g. visualization of text data in a semantic space)
(2) The art of feature engineering - (e.g. converting raw inputs into meaningful and discriminative features)
(3) The art of dealing with nuances in class labels - (e.g. creating, sampling, and cleaning up class labels)
(4) The art of combining labeled and unlabeled data - (e.g. semi-supervised and active learning)
(5) The art of decomposing a complex modelling problem into simpler ones - (e.g. divide and conquer)
(6) The art of using textual features with structured features to build models, etc.
The key objective of the talk is to share some of the learnings that might come in handy while "designing" and "debugging" machine learning solutions and to give a fresh perspective on why data mining is still mostly an ART.
The role of a data scientist has evolved in the last few years from someone who can "put-together" a "modelling pipeline" to someone who can: (a) "understand" the data beyond basic statistics and simple visualizations, (b) extract "deep" and "novel" insights from the data, (c) engineer "better features" to fairly distribute complexity between features and models, (d) visualize and make sense of complex data types like networks, unstructured text corpora, etc., and (e) create innovative ways of harnessing data to make smarter decisions.
In order to create "magic from data", a data scientist must go beyond the SCIENCE, ENGINEERING, and PROCESS and delve into the ART of data mining. In this talk I will share a number of "mistakes" and "innovations" in this context that helped me build better models in domains as diverse as remote sensing, text classification, text clustering, fraud detection, information retrieval, bioinformatics, retail data mining, and image understanding, etc.
These practical insights might help the audience pay attention to the right details in the modelling process, look for model improvements in the right places, be more creative with their data and use its full potential, and even overcome the limitations of their modeling tools.
This talk is more about modelling methodology insights than tools and algorithms. Some prior experience with building machine learning models (in any domaiin, using any technique) might be helpful but not required.