Compromising a $6B big data project through poor data quality: the Aadhaar case study

This submission has been added to the schedule

Compromising a $6B big data project through poor data quality: the Aadhaar case study

Submitted Jul 2, 2018

Section: Full talk Technical level: Beginner

The Aadhaar project holds at least 3 PB of data and possibly more. It’s promise of providing a unique multi-modal biometric backed Identity to everyone in India has hinged on the quality of biometric templates obtained during enrollment and also the veracity and trustworthiness of the identity documents. The scale needed for the project can only be achieved through enrollment centers that are spread out and available in every village.

The UIDAI enlisted Common Service Centers (CSC) and 60,000 Private agents for both enrollment and updates. Frauds are inevitable in such a scaled-up system and hence much care was taken to standardize the enrollment process through software, which was deployed on the enrollment centers. Another set of defences were built in the back end through data analytics to detect fraud.

Every confirmed fraud resulted in adding more security features in the enrollment software. However the UP Aadhaar hack case, provided the first glimpse into how the fraudsters managed to disable the security features and also defeat the back end data analytics. The offline nature of data acquisition provided a window large enough to compromise the data quality of the biometric templates and the identity documents without getting detected for at least an year.

Further the streamlining of the process of data acquisition, made it very hard to stop the deployment and further usage of the compromised software even today, thus illustrating the problem of data acquisition at scale with good quality.

Outline

The Aadhaar enrollment software and how it works.
The Data quality checks in the software for maximizing enrollment success.
Additional meta data created by the software for successful fraud detection in the back end.
Case Study 1 - Data pollution using exceptions - The ILF&S fraud case and how the humble postman detected it but not Big data analytics.
Case Study 2 - The Accelerating data quality errors - How UIDAI missed the tea leaves.
Case Study 3 - The UP Aadhaar hack case - Why the first version of the software only had biometric overrides.
Case Study 4 - The Punjab hack case - Why it only had fraud detection overrides (such as GPS)
Case Study 5 - The Bengal hack case - Why it had overrides for biometric data quality overrides.
Case Study 6 - The missing Identity documents
Cost benefit analysis from a fraudster’s point of view, fighting against a Big data analytics engine.

End goal of this talk is to make attendees recognize that

Scaling data acquisition systems deployed on a country-wide basis creates novel challenges that can fully compromise data quality.
Offline data acquistion systems (Eventually consistent) need full tamper proofing for analytics to be effective.
Sentient opponents facing a machine driven intelligence, will focus on corrupting it’s data inputs and be successful.

Requirements

Just your Aadhaar numbers!

Speaker bio

I helped the Petitioners, who challenged the Aadhaar project on the Supreme Court of India to understand the technology behind Aadhaar and can say with some modesty that I also helped the senior counsels Mr. Shyam Divan, Mr. Gopal Subramaniam, Mr. Anand Grover and Mr. Vishwanath to sharpen their propositions in the court.

Slides

https://docs.google.com/presentation/d/1SY-6qotYFnslaKopuA3V-mF7pc4jenvMl8YT21JRL_k/edit?usp=sharing

The Fifth Elephant 2018

Compromising a $6B big data project through poor data quality: the Aadhaar case study

Outline

Requirements

Speaker bio

Slides

Comments