The Fifth Elephant 2013

An Event on Big Data and Cloud Computing

The art and science of exploiting near-similar text and images

Submitted by Srinivasan H Sengamedu (@shs) on Wednesday, 5 June 2013

videocam_off

Technical level

Intermediate

Section

Analytics and Visualization

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +6

Objective

Big Data, by its inherent nature, will have near-similar items. Identifying the repetitions and, even better, leveraging them to get your job done is both an art and science. The goal of this talk is to share some experiences with this and to get you excited about this.

Description

I will first motivate how data repetitions provide an opportunity in several tasks: image recognition, spam detection, string matching, etc. I will then talk about specific techniques for scalably identifying such near-duplicates: signature-based near-duplicate image detection, sequence mining, new string similarity measure. By then, hopefully, you're excited enough to take a relook at your data.

Requirements

Ability to look at the big picture without being bogged down by details. And the ability to look at details after that.

Speaker bio

I've worked on machine mearning applications in web content analysis, web search, and web advertising - now at Komli and previously at Yahoo and other places. I've also published and patented some of this stuff.

Slides

http://www.slideshare.net/informationexcellence/information-excellence-2012febkomlisrinivasan-s-hmaking-data-repitions-work

Comments

Login with Twitter or Google to leave a comment