Storing relationships in large data-sets using Graphs
Problem Statement - Fast Programmatic/self-serve analytics on linked data in an ad system by indexing it across all cuts, especially for traversals like -
- Find all users who came from ‘iphone’ and ‘SFO’ with 10k or more clicks within the last two days.
- Find all users who played ‘Subway sufers’ from U.S. more than 10 times in the last week.
As it’s evident from the above examples these class of queries are different from a typical pointed query like - “find my friends who have been to golden gate birdge in the last year and have liked hiking articles”. This class of query start with a point lookup and then a BFS traversal with appropiate filtering criteria which are addressed by db’s like neo4j, titan in a generic fashion.
Scope of the talk -
- highlight the internals of what it takes to solve for non-pointed queries in a generic fashion.
- extend it to support the tinker pop api specification from neo4j, titan so that users can easily flip from one backend to another.
This work was motivated to store large amounts of linkeddata in an ad system and make it available for programmatic/analytics consumption.
This talk outlines our journey which started from researching existing graphdb’s/processing frameworks, why they didn’t work for us at our scale and then moving on to build something.
We will go in depth to explain the data-structures used and how we supported the tinker-pop graph API specification( used by all graph databases). We will also touch upon how our ad-system unique data model allowed us to come up with a fairly simplistic technique to shard the entire thing and query over it.
Takeaways from this talk -
- what are graphdb’s, when should you choose one.
- different use-cases require different stores.
- what it takes to build a graph store for allo-centric(alike OLAP) graph traversals.
- inclination towards linked-data, everything else will be covered.
Inder Singh - have been working on solving data related problems at Inmobi(World’s largest independent ad-network) for the past ~3 years.
- LinkedIn - http://www.linkedin.com/in/indspall
- Speaker at hadoop meetups at informatica, Yahoo - http://www.slideshare.net/InderSingh10/meetup-realtime-datacollection