Open Source AI Hackathon

Open Source AI Hackathon

The Fifth Elephant Winter Edition Hackathon

Make a submission

Accepting submissions till 15 Feb 2024, 11:00 PM

Microsoft Reactor Bengaluru, Bengaluru

About the hackathon

The aim of this hackathon is to encourage individuals/teams to apply and develop innovative AI ideas/use cases and publish them as open source projects.

Who can participate

  1. Working professionals
  2. Students
  3. Independent consultants
  4. AI researchers
  5. ML engineers
  6. Lawyers, doctors, agronomists, artists, and others who are keen to collaborate with technologists, and showcase ideas and working demos.

Criteria for submitting projects

  1. Ideas should be open source.
  2. Code should be open source with a permissive open-source LICENSE file added.
  3. Orchestrate your code in such a way that it works with open-source models (pre-trained and fine-tuned), open-source products, platforms, systems, and tools.

How to participate

  1. Submit your project idea and outline here.
  2. Join The Fifth Elephant WhatsApp group to discuss your submission with the mentors.
    Or, if you want to validate your idea/project before submitting it, you can discuss it with the mentors, either in the WhatsApp group or on DM.
  3. Participants should work on their projects and start building soon after submitting ideas. Participants have the entire month of February to work on their projects. The last date for submitting projects is 28 February.
  4. Mentors will be assigned to projects which are shortlisted. Inactive projects, or projects that are not in the consideration list will not be assigned mentors.
  5. Mentors will comment on the submissions during the period - all through Febryart. The reward of the hackathon is the feedback, not just the cash prize.
  6. Demo day for all shortlisted hackathon projects — in person and remote — will be on 10 March. The jury will review the submissions and announce prize winners.

Mentors

  • Bharat Shetty is an AI/ML Consultant. He has worked for Airtel Labs and other organizations on AI/ML/NLP platforms and products, across diverse verticals such as conversational AI, EdTech, IOT, and healthcare. Bharat is the editor of The Fifth Elephant Winter edition, and papers discussion community.

  • Abhishek Mishra is a PSF Fellow and software engineering enthusiast, driving tech events like PyCon India, APAC, Chaos Carnival Conference, and GDG, dedicated to fostering community-centric initiatives.

  • Aniket Maurya is spearheading the creation of intelligent software using AI, serving as a Developer Advocate at Lightning AI ⚡️, and is the creator of GradsFlow.

  • Simrat Hanspal has a career spanning over a decade in the AI ML space, specializing in Natural Language Processing. Currently spearheading AI product strategy at Hasura and has led AI teams at renowned organizations such as VMware, FI Money, and Nirvana Insurance in the past.

  • Sumod Mohan is the co-founder and C.E.O of stealth startup AutoInfer Private limited. He is also technical Advisor and previously CTO of Niqo Robotics where he helped build robots to remove weeds from agricultural farms. This work won the Ministry of Electronics and Information technology (MeitY) and Niti Ayog’s RAISE 2020 Challenge in the Agriculture sector. He was an Advisor to WebCardio, AI based Holter manufacturer (wearable ECG) and led the Computer Vision Division at Soliton Technologies. He was also CTO of Digital Aristotle, which was acquired by Byjus. He has over 15 years of research experience in Computer Vision and over 10 in productizing these technologies in the US and India. Prior to this he worked for HighlightCam Inc, a startup in California where he led Computer Vision Algorithm Development. He holds an M.S degree from Clemson University, USA with a specialization in Intelligent Systems and Robotics.

Editors

  • Bharat Shetty is an AI/ML Consultant. He has worked for Airtel Labs and other organizations on AI/ML/NLP platforms and products, across diverse verticals such as conversational AI, EdTech, IOT, and healthcare. Bharat is the editor of The Fifth Elephant Winter edition, and papers discussion community.
  • Akshobhya Jamadagni is Editorial Assistant for The Fifth Elephant Open Source AI Hackathon. He is passionate about contributing value across various levels of abstraction, from high-level technical strategy to detailed implementation.

Team composition

  1. You can submit your project as an individual.
  2. Team size is restricted to a maximum of 3 members.
  3. Add your teammates as collaborators after submitting your idea.

Ideas for the hackathon

Participants can propose projects around some of the following ideas:

  1. AI for Scientific Research: e.g. Protein folding models, climate models, drug discovery, image recognition for scientific research, simulations for material science, epidemiology, and more.
  2. AI for inclusivity and accessibility: e.g. STT/TTS, automated audio descriptions (for non-voice content), automated color blindness correction, AI-powered sign language generation, real-time AI-powered captioning display for events, educational resources, and content translation across languages by leveraging multi-lingual models, adaptive content for differences in learning ability and/or neurodivergence, etc.
  3. AI and creative expression: e.g., generative audio, video, text, and visuals and ways to combine these in a production-oriented direction, including AR/VR/Gaming and OTT implementations.
  4. AI in education: e.g., personalized learning plans, adaptive learning plans, content creation, translation with context, AI tutors, productivity tools, well-being improvement tools, etc.
  5. AI for India: for e.g., India-specific law, models that focus on indic languages, renewable energy optimization, disaster response and relief, and education accessibility.
  6. Additionally, participants can also pick and work on ideas from the list of ideas submitted in this spreadsheet.

Jury - to be announced

Project Evaluation Criteria

Project Evaluation Criteria Presentation

Prizes

Five prizes of ₹1,00,000 (One lakh rupees) per theme, will be awarded to winners at the hackathon.

About The Fifth Elephant

The Fifth Elephant is a community funded organization. If you like the work that The Fifth Elephant does and want to support meet-ups and activities - online and in-person - contribute by picking up a membership

Contact information

If you have questions about hackathon, post a comment here, or join The Fifth Elephant Telegram group and the WhatsApp group.

Follow @fifthel on Twitter.

For any inquiries, call The Fifth Elephant at +91-7676332020.

Sponsored by Meta

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Supported by

Partner

Microsoft for Startups Founders Hub is a digital ecosystem removing barriers to building a company with free access to technology, coaching, and support for founders in any stage of development. Let us accelerate your startup journey from idea-to-exit. Find out more here: https://startups.microsoft… more

Soma Dhavala

@dhavala

Yashwardhan Chaudhuri

@chaudhuri contributor

Sai Nikhilesh Reddy

@SaiNikhileshReddy contributor

Project Seshu

Submitted Jan 28, 2024

Project Seshu

Introduction

chatGPT caught the public imagination and for the first time non-tech people could experience Generative AI. This led to a surge in interest to develop safe applications of LLMs as well as developing domain specific or open source alternatives to chatGPT. Notable among them is LLaMA 2 - an LLM open sourced by Meta. This release catalyzed the development of tasks, tools, and assets ranging from datasets to new models to new applictions. An SLM called Phi 2 released by Microsoft also showed that small models (reletively that is) can also compete with large models, which can be trained and served at substantially lower costs. However, there are some challenges.

  1. Majority, if not all the LLMs, we see today are based on proven Transformer based architectures. And Transfomres have quadratic (in inputs tokens) complexity - therefore slow to train and infer. As a result, new memory and compute efficient attention mechanisms have sprungup, along with Engineering hacks. But, at the end of the day, they are still based on Transformer-based architectures.
  2. Majority, with the exception of some Chinese LLMs, are English-centric and other languages have a token representation (no pun intended).
  3. Often, LLMs have a particulalr tokenizer -- which makes extension to other languages/ domains hard.
  4. Developing SLMs or LLMs is still a compute heavy problem. Therefore only big corporations with deep pockets, massive talent concentration and GPU farms can afford to build such models.

In this hackathon, we like to address the above challenges.

Proposal

Objectives

  1. Develop a multilingual S4-based SLM on samantar dataset
  2. Decentralise the training and development of SLMs/LLMs via simple federated learning framework

Deliverables

Phase-1: A multilingual tokenizer-free LLM based on S4

  1. Tokenizer-free: No subword tokenizers. Use ByT5 - so that models can be trained end-to-end. However, byte level tokens increase context length, which puts Transformer based architectures at disadvantage
  2. RoPE embeddings: Byte level tokens (unicode characters) increase the context length -- so have to deal with extended context lengths via RoPE embeddings (or other)
  3. Mamba: S4 (sparse structued state space models) are now competing with Transformer based models in sequence representation/classification problems. Replace Tansformer architecture with Mamba for efficient training and inference (which address multiple problems due the entanglement of Transformer architecture with subword tokenizers)
  4. Train the above model on small multilingual indic dataset

Phase-2: Decentralised development of Indic LLMs

  1. Add LoRA adapters to Mamba (to finetune LLMs on modest resoruces both data and compute) - few research questions arise here!
  2. Implement a Client-Serve Architecture for Federated Learning (no emphasis on privacy at this time - as datasets used will be public)
  3. Client side
    3.1 Client downloads a latest pre-trained model, a small dataset, and initialises the adapter
    3.2 fine-tunes the adapter on a small subset of the data
    3.3 Pushes the adapter back to the hub (which server can access)
  4. Server side
    4.1 Server issues a client - a pretrained model, an initialised adapter, and a small dataset - depending on the compute budget (flops and time)
    4.2 Server merges the adapters with the base model
    4.3 Does a continual pre-training of the base
    4.4 Checkpoints the pre-trained model

This simple transactional, client side memory less, federated learning framework - democriatises the training (and development) of LLMs/ SLMs

Phase-3: First class citzen of Huggingface ecosystem

  1. Mamba supported in HF (say in Transformers library)
  2. SFT and Model merging with PEFT library
  3. Submitting adapters and sharing pre-training checkpoints via HF Hub

Learning Objectives

With community participation (particulalry, the student commuity), we want to understand

  1. How do S4 models compare against Transfomer based model on many dimensions like scaling laws, compute efficiency, etc?
  2. As a function of model size - how does the performance change
  3. As a function of vocabulary size - how does the training and inference time change
  4. How does the cross-lingual transfer capability gets affected as a funciton of the representation in the data?
  5. What federated learning policy is suitable for distributed training LLMs w.r.t data partitions, adapters, model merging?
  6. Many more questions will come up in the due course as we encounter many problems and challenges!

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Make a submission

Accepting submissions till 15 Feb 2024, 11:00 PM

Microsoft Reactor Bengaluru, Bengaluru

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Supported by

Partner

Microsoft for Startups Founders Hub is a digital ecosystem removing barriers to building a company with free access to technology, coaching, and support for founders in any stage of development. Let us accelerate your startup journey from idea-to-exit. Find out more here: https://startups.microsoft… more