Analytics using Hadoop ecosystem on AWS
Submitted by Rajat Venkatesh (@vrajat) on Monday, 20 May 2013
The workshop will go through the steps required to use the AWS ecosystem as an analytics backend. While we will discuss general design patterns - in many cases we will show examples using the Qubole platform.
Organizations who want to perform analytics in the AWS Cloud need to figure out the following: How do we get our log data sets into the cloud (AWS S3)? How do we import data to Amazon S3 from on-premise or on-cloud databases such as mysql, mongodb or postgres? Do I need a persistent Hadoop Cluster? How do I setup the system so that multiple users within the organization can run M/R, Pig or Hive commands? What are the best practices for organizing data on S3 for long term storage and query? What about security? What are the security risks of doing analytics in the cloud? What about cost? What is the role of Hadoop versus traditional data warehouses like Vertica and AWS Redshift? What about data visualization? How I do build reports using this infrastructure and where do i host them? We will layout some common design patterns and alternatives for these questions. For some of the questions - we may highlight features in the Qubole platform - and similarly where we go through live examples - we maybe using Qubole Data Service. After this workshop, attendees will be better informed on the process to get data analytics up and running on AWS.
A laptop. AWS account. If you have data, store it in AWS S3.
Rajat Venkatesh is a engineer at Qubole and has experience in all aspects of helping users analyze their data on AWS. Before Qubole, he worked as a database kernel developer at Vertica - a big data analytics platform.