Design for Data
When evaluating the quality and likelihood of success of AI/ML projects, I have found it helpful to think in terms of three core components: Workflow, Data, and Algorithms. In media and public discussion algorithms tend to receive the most attention, and for young data scientists they are often what seem most exciting. This talk will focus on the two underrated other components: workflow and data. In the majority of cases I’ve seen, as both a data scientist and an investor, they are what determine whether a project will really make a difference and produce practical success. Good, high-quality data comes from the work of design, and the work of design is fascinating, challenging, rewarding, and deserves every data scientist and engineer’s attention and practice. I will present a few key steps of designing for data, and lots of practical and real examples and illustrations from my work and study as a data scientist.
- Introduction: the framework of Workflow, Data, Algorithm for AI/ML projects.
- What is data? A representation of a part of the world that we care about.
- The Data Generating Process
- The data collection process (the technology and operations by which data reaches a database)
- The statistical model
- The probabilistic model
- Data Quality as a function of data use - availability and visibility
- Knowing the past readily - before predicting the future
- The Complexity of Taking Action on the World - Learning from Machine Learning
- Tracking and storing models, predictions, and results
- Conclusion and Takeaways
Past experience with real-world data science projects will be helpful. The talk will aim to provide something for beginners as well as advanced professionals.
Paul Meinshausen is a Data Scientist in Residence at Montane Ventures, an early-stage venture capital fund. Previously he was CoFounder and Chief Data Scientist at PaySense, a mobile fintech startup in Mumbai. Earlier roles include Vice President of Data Science at Housing.com, and Principal Data Scientist at Teradata. He has a research background in behavioral and cognitive science, first started working on big and unstructured data for the U.S. Department of Defense in Afghanistan, and was a Data Science for Social Good Fellow at the University of Chicago’s Computation Institute.