arrow_back RHadoop: Marrying analytics & large scale data processing
Need for “Lmetric” : the service for near real-time clickstream events and User behavior analysis
Submitted by Piyush (@piykumar) on Thursday, 30 May 2013
Storage and Databases
Session will cover following things:
• Different type of data sources : Collection, trending and Analysis. • Capturing Events: Why structured data emitted from apps for machines is a better approach. • Centralized Collection of events in distributed environments • Easy access to collected data in consumable form for reporting and Analysis
In the start of the session will try to touch upon the MakeMyTrip Infrastructure setup and the data handling / challenges with Multi-DC/colocation setup.
Learning’s from the past setup and the need to have new generation of tools!
• Different type of data sources : Collection, trending and Analysis. o Understanding the landscape of data sources both Internal & External(structured, semi-structured, unstructured).
o Tools and techniques used for collecting , trending and analytics.
• Capturing Events: Why structured data emitted from apps for machines is a better approach. o Need for standardization: JSON as a standard for capturing events / messages from client and server side applications.
• Centralized Collection of events in distributed environments o Centralized Event Management plays key role in both operational excellence and complete Visibility across different layers.
Key things for implementing solution for the above: • Collection (Event Collector) & Filtering • Indexing & Searching • Easy access to collected data in consumable form for reporting and Analysis o Reporting & Visualizations
• Best practices for designing such systems where events are collected, managed and made available with complete reliability like : o Use timestamps for every event
o Use unique identifiers (IDs) like Transaction ID / User ID / Session ID or may be append unique user Identification (UUID) number to track unique users.
o NTP synced same date time / timezone on every producer and collector machine(#ntpdate ntp.example.com).
o The 80/20 Rule: %80 or of our goals can be achieved with %20 of the work, so don’t log too much
Piyush is currently managing Website Operations and is part of BI team at MakeMyTrip. At MakeMyTrip his daily affairs includes DevOps , Security Operations & Infrastructure initiatives and is responsible for business & operational goals for performance, security, and availability.