Session on "Use Cases and Risks of ML in Capital Markets" | 23rd Dec at 4pm Hi everyone! The AI and Risk Mitigation project is well underway and for the third session, we will be joined by Rachna Maheshwari, Associate Director at CRI… more
The 2023 Monsoon edition is curated by:
- Nischal HP, Vice President of Data Engineering and Data Science at Scoutbee. Nischal curated the MLOps conference which was held online between 23 and 27 July 2021.
- Sumod Mohan, Founder and CEO at AutoInfer. Sumod curated Anthill Inside 2019 edition, held in Bangalore on 23 November.
- AI and Research - covers research, findings, and solutions for challenges on building models in various areas such as fraud detection, forecasting, and analytics. This track delves into the latest methodologies for handling challenges such as large-scale data processing, distributed computing, and optimizing model performance.
- Industrial applications of ML - covers implementation of AI in the industry, with more focus on the AI models, the issues in training, gathering data so, and so forth. ML is being used at scale in industries such as automotive, mechanical, manufacturing, agriculture, and such domains. This track focuses on the challenges in this space, as we see innovation coming out of these industries in the pursuit of using ML on a second-to-second basis.
- AI and Product - covers strategies for building AI products to scale and mitigating challenges. This track provides insights on incorporating AI tools and forecasting techniques to improve model training, developing a working model architecture, and using data in the business context.
There are three phases in the lifecycle of an application - research, application and aftermath of the application.
- Assess capabilities, determining the new frontiers for AI.
- Find a use for the application.
- Learn how to run it, monitor it and update it with time.
The three tracks at the 2023 Monsoon edition of The Fifth Elephant will cover this lifecycle.
The Fifth Elephant 2023 Monsoon edition will be held in-person. Attendance is open to The Fifth Elephant members only. Purchase a membership to attend the conference in-person. If you have questions about participation, post a comment here.
- Data/MLOps engineers who want to learn about state-of-the-art tools and techniques, especially from domains such as automobile, agri-tech and mechanical industries.
- Data scientists who want a deeper understanding of model deployment/governance.
- Architects who are building ML workflows that scale.
- Tech founders who are building products that require AI or ML.
- Product managers, who want to learn about the process of building AI/ML products.
- Directors, VPs and senior tech leadership who are building AI/ML teams.
Sponsorship slots are open for:
- Infrastructure (GPU, CPU and cloud providers) and developer productivity tool makers who want to evangelise their offering to developers and decision-makers.
- Companies seeking tech branding among AI and ML developers.
- Venture Capital (VC) firms and investors who want to scan the landscape of innovations and innovators in AI and who want to source leads for investment in the AI and ML space.
Building efficient and secure vector data workflows
Large Language Models have demonstrated amazing capability for solving complex problems. But they can’t answer what they haven’t seen, and to take advantage of these amazing models, we need to expose our data to the model. Fine-tuning is not an option, at least not a cheap one. Prompt engineering is a helpful technique to provide context to LLMs, which helps the model restrict its answer to the context.
To use context in real-time queries, we need to process and store them beforehand. This includes vectorizing and storing in a vector database. This works well if we were just querying for similar text, but production application use cases are complex. Your application data is sitting in a structured relational database like Postgres and vector data in a vector store like Pinecone. Data fragmentation is always a hard problem to solve.
Additionally, semantics searches are not a replacement for conditional queries. Prices <$20 is not the same as Price >$20. Yet, it will come out as highly similar because conceptually they are similar, i.e. it is a comparison statement with price value.
In some cases, we may also be required to query vectors with predicate filtering, i.e. filtering similarity search on a subset of documents.
This is not a solved problem; some vector dbs provide additional metadata store to run such queries as if it was structured data, but it isn’t the same as your data model sitting in a relational db like Postgres. And, if data fragmentation wasn’t enough of a problem, we are now talking about data duplication. And what about failures and data states going out of sync?
And, if all of my application data, structured data, is sitting in a relational database and my contextual data in a vector database, how can I securely access my vector data with role-based access control and authorization when all of the data for a role is sitting in relational database.
I will motivate the talk by talking through different queries that one works with vector dbs.
We will also get a sense of popular vector dbs like Weaviate, Pinecone, Milvus, etc., to understand the capabilities they support.
We will understand how Hasura’s Data API solves these problems by providing an abstraction over different data stores giving the capability to query with remote joins as it was all in the same storage along with authentication and Role-based access control, a big gap in current vector db workflow.
Additionally, we will walk through some of the features like event triggers can help set up auto-vectorization on insert/update/delete events in no time without your vectorization workflow getting tied down to a specific vector db.
The super powers of authentication and RBAC can also be applied to your LLM calls using remote actions in Hasura. This can be extended by creating custom prompts and executing RBAC over them. For example, a LLM query like summarise public data can be open to all but summarising employee data is open to only managers and HR.