The Fifth Elephant 2017

On data engineering and application of ML in diverse domains

##Theme and format
The Fifth Elephant 2017 is a four-track conference on:

  1. Data engineering – building pipelines and platforms; exposure to latest open source tools for data mining and real-time analytics.
  2. Application of Machine Learning (ML) in diverse domains such as IOT, payments, e-commerce, education, ecology, government, agriculture, computational biology, social network analysis and emerging markets.
  3. Hands-on tutorials on data mining tools, and ML platforms and techniques.
  4. Off-the-record (OTR) sessions on privacy issues concerning data; building data pipelines; failure stories in ML; interesting problems to solve with data science; and other relevant topics.

The Fifth Elephant is a conference for practitioners, by practitioners.

Talk submissions are now closed.

You must submit the following details along with your proposal, or within 10 days of submission:

  1. Draft slides, mind map or a textual description detailing the structure and content of your talk.
  2. Link to a self-record, two-minute preview video, where you explain what your talk is about, and the key takeaways for participants. This preview video helps conference editors understand the lucidity of your thoughts and how invested you are in presenting insights beyond your use case. Please note that the preview video should be submitted irrespective of whether you have spoken at past editions of The Fifth Elephant.
  3. If you submit a workshop proposal, you must specify the target audience for your workshop; duration; number of participants you can accommodate; pre-requisites for the workshop; link to GitHub repositories and documents showing the full workshop plan.

##About the conference
This year is the sixth edition of The Fifth Elephant. The conference is a renowned gathering of data scientists, programmers, analysts, researchers, and technologists working in the areas of data mining, analytics, machine learning and deep learning from different domains.

We invite proposals for the following sessions, with a clear focus on the big picture and insights that participants can apply in their work:

  • Full-length, 40-minute talks.
  • Crisp, 15-minute talks.
  • Sponsored sessions, of 15 minutes and 40 minutes duration (limited slots available; subject to editorial scrutiny and approval).
  • Hands-on tutorials and workshop sessions of 3-hour and 6-hour duration where participants follow instructors on their laptops.
  • Off-the-record (OTR) sessions of 60-90 minutes duration.

##Selection Process

  1. Proposals will be filtered and shortlisted by an Editorial Panel.
  2. Proposers, editors and community members must respond to comments as openly as possible so that the selection processs is transparent.
  3. Proposers are also encouraged to vote and comment on other proposals submitted here.

Selection Process Flowchart

We will notify you if we move your proposal to the next round or reject it. A speaker is NOT confirmed for a slot unless we explicitly mention so in an email or over any other medium of communication.

Selected speakers must participate in one or two rounds of rehearsals before the conference. This is mandatory and helps you to prepare well for the conference.

There is only one speaker per session. Entry is free for selected speakers.

##Travel grants
Partial or full grants, covering travel and accomodation are made available to speakers delivering full sessions (40 minutes) and workshops. Grants are limited, and are given in the order of preference to students, women, persons of non-binary genders, and speakers from Asia and Africa.

##Commitment to Open Source
We believe in open source as the binding force of our community. If you are describing a codebase for developers to work with, we’d like for it to be available under a permissive open source licence. If your software is commercially licensed or available under a combination of commercial and restrictive open source licences (such as the various forms of the GPL), you should consider picking up a sponsorship. We recognise that there are valid reasons for commercial licensing, but ask that you support the conference in return for giving you an audience. Your session will be marked on the schedule as a “sponsored session”.

##Important Dates:

  • Deadline for submitting proposals: June 10
  • First draft of the coference schedule: June 20
  • Tutorial and workshop announcements: June 20
  • Final conference schedule: July 5
  • Conference dates: 27-28 July

##Contact
For more information about speaking proposals, tickets and sponsorships, contact info@hasgeek.com or call +91-7676332020.

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Vijay Srinivas Agneeswaran, Ph.D

@vijayagneeswaran

Distributed Consensus and Data Safety: NewSQL Perspective

Submitted Apr 18, 2017

We explore data safety issues in designing large distributed systems. Though data safety issues have been addressed in traditional complex software systems such as aircraft engineering systems, ensuring data safety in distributed systems is a complex and arduous task. The complexity is due to necessity to ensure safety of various data such as configuration data, state changes at individual nodes, global state changes etc. Further, ensuring consistency of global state as well as the verification and validation of all the above data is required. We explore formal verification of the safety properties of distributed systems through recent work on IronFleet (http://sigops.org/sosp/sosp15/current/2015-Monterey/250-hawblitzel-online.pdf).

We start from distributed consensus problem and explain how it can be defined interestingly using the parable of La Tryste, leading to the Fischer, Lynch and Peterson’s impossibility result. We then illustrate conditions/assumptions under which consensus is possible. We discuss how failure detectors can be used to solve consensus. We go on to discuss Paxos algorithms and its various formulations and variations/simplifications. We talk about the CAP theorem and illustrate choices made by different NoSQL systems in this respect. We then present the commit protocols as variations of distributed consensus and illustrate their importance for data safety.

We explore the different kinds of NewSQL datastores which have emerged in the last few years and tackle data safety by providing ACID consistency of distributed state across large collection of nodes. We briefly outline systems such as Google Spanner [1], Clusterix, VoltDB, NimbusDB etc.

We outline how Google Spanner on the other hand provides ACID consistency across a wide-area based distributed system. It provides a strict form of consistency known as Linearizability [2]. It is the first system to do so across a WAN. Spanner assigns global timestamps to transactions across a distributed set of nodes; timestamps reflect serialization order. The key to Spanner’s global timestamps are its TrueTime API and its implementation. The TrueTime API abstracts and exposes clock uncertainty and allows applications to reason with uncertainty, while the TrueTime API implementation in Google’s datacenters restricts the uncertainty to less than 10 milliseconds. The uncertainty is small compared to say NTP where the deltas between different clocks across a distributed system can be as high as 250 milliseconds. Google’s TrueTime API implementation has achieved that by having two physical clocks on each node: atomic and GPS. F1, an advertisement backend built by Google is the first “client” of Google Spanner and used it in production. In essence, Spanner provides the following features: semi-relational tables, query language based on SQL, elasticity and the notion of ACID transactions.

Another interesting aspect that should be kept in mind while designing large distributed systems is that most existing algorithms, including Paxos and those used in Google Spanner do not solve the Byzantine consensus problem [7]. Byzantine consensus is a formulation of the consensus problem with extreme behavior attributed to nodes, allowing reasoning about difficult real-world conditions such as software bugs. One may have to explore block chain [8] kind of technologies to solve Byzantine consensus.

For more details, please see my blog in the ACM blog site:
http://wp.sigmod.org/?p=2153.

[1] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s Globally-Distributed Database. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, Berkeley, CA, USA, 251-264.

[2] Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A Correctness Condition for Concurrent Objects. ACM Transactions on Programming Languages and Systems 12, 3 (July 1990), 463-492.

[3] Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (June 2002), 51-59. DOI: https://doi.org/10.1145/564585.564601

[4] Butler Lampson. 2001. The ABCD’s of Paxos. In Proceedings of the twentieth annual ACM symposium on Principles of distributed computing (PODC '01). ACM, New York, NY, USA, 13-. DOI=http://dx.doi.org/10.1145/383962.383969.

[5] Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. 2011. Zab: High-performance broadcast for primary-backup systems. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks (DSN '11). IEEE Computer Society, Washington, DC, USA, 245-256.

[6] Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In Proceedings of the 2014 USENIX conference on USENIX Annual Technical Conference (USENIX ATC’14), Garth Gibson and Nickolai Zeldovich (Eds.). USENIX Association, Berkeley, CA, USA, 305-320.

[7] Leslie Lamport, Robert Shostak, and Marshall Pease. 1982. The Byzantine Generals Problem. ACM Transactions on Programming Languages and Systems 4, 3 (July 1982), 382-401. DOI=http://dx.doi.org/10.1145/357172.357176.

[8] Crosby, M., Pattanayak, P., Verma, S., & Kalyanaraman, V. (2016). Blockchain technology: Beyond bitcoin. Applied Innovation, 2, 6-10.

Outline

  1. Data safety issues in distributed systems
  2. Overview of Paxos and Distribtued consensus algorithms
  3. NewSQL datastores - brief on GoogleSpanner, Clusterix, NimbusDB etc.
  4. IronFleet and formal verification of safety properties of distributed systems.

Requirements

Fundamentals of distributed systems.

Speaker bio

Dr. Vijay Srinivas Agneeswaran has a Bachelor’s degree in Computer Science & Engineering from SVCE, Madras University (1998), an MS (By Research) from IIT Madras in 2001, a PhD from IIT Madras (2008) and a post-doctoral research fellowship in the LSIR Labs, Swiss Federal Institute of Technology, Lausanne (EPFL). He has joined as Director of Technology in the data sciences team of SapientNitro. He has spent the last ten years creating intellectual property and building products in the big data area in Oracle, Cognizant and Impetus. He has built PMML support into Spark/Storm and realized several machine learning algorithms such as LDA, Random Forests over Spark. He led a team that designed and implemented a big data governance product for a role-based fine-grained access control inside of Hadoop YARN. He and his team have also built the first distributed deep learning framework on Spark. He is a professional member of the ACM and the IEEE (Senior) for the last 10+ years. He has four full US patents and has published in leading journals and conferences, including IEEE transactions. His research interests include distributed systems, data sciences as well as Big-Data and other emerging technologies. He has been an invited speaker in several national and International conferences such as O’Reilly’s Strata Big-data conference series. He will also be speaking at the Strata Big-data conference in London in May 2017. He also gave a keynote speech at the Fifth Elephant conference in 2014. He lives in Bangalore with his wife, son and daughter and enjoys researching history and philosophy of Egypt, Babylonia, Greece and India.

Slides

https://drive.google.com/file/d/0B2TbzamOB-KMYkNGODQ1X3k2Xzg/view?usp=sharing

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more