Schematise (formerly, "Complianalyse") – The Fifth Elephant Open Source AI Hackathon 2024

Dec 2023

18 Mon

19 Tue 05:30 PM – 06:30 PM IST

20 Wed

21 Thu

22 Fri

23 Sat

24 Sun

Jan 2024

1 Mon

2 Tue

3 Wed

4 Thu

5 Fri 05:30 PM – 07:20 PM IST

6 Sat

7 Sun

Jan 2024

8 Mon 06:00 PM – 06:55 PM IST

9 Tue

10 Wed 06:00 PM – 07:00 PM IST

11 Thu

12 Fri 06:00 PM – 07:30 PM IST

13 Sat 03:00 PM – 06:00 PM IST

14 Sun

Jan 2024

22 Mon

23 Tue

24 Wed

25 Thu

26 Fri

27 Sat 05:00 PM – 05:45 PM IST

28 Sun

Feb 2024

29 Mon

30 Tue

31 Wed

1 Thu

2 Fri

3 Sat 10:00 AM – 06:25 PM IST

4 Sun

Feb 2024

5 Mon

6 Tue

7 Wed 08:15 PM – 09:00 PM IST

8 Thu

9 Fri

10 Sat

11 Sun

Feb 2024

12 Mon 08:15 PM – 09:00 PM IST

13 Tue 08:15 PM – 09:00 PM IST

14 Wed 08:15 PM – 09:00 PM IST

15 Thu 08:15 PM – 09:00 PM IST

16 Fri 07:30 PM – 08:30 PM IST

17 Sat 08:15 PM – 09:00 PM IST

18 Sun

Feb 2024

19 Mon

20 Tue

21 Wed 08:30 PM – 09:15 PM IST

22 Thu

23 Fri

24 Sat

25 Sun

Mar 2024

4 Mon

5 Tue

6 Wed

7 Thu

8 Fri

9 Sat 07:00 PM – 09:00 PM IST

10 Sun 04:00 PM – 06:00 PM IST

Apr 2024

8 Mon

9 Tue

10 Wed

11 Thu

12 Fri 12:00 PM – 06:25 PM IST

13 Sat

14 Sun

Hasura, Bangalore

Tickets

All submissions

This submission has been added to the schedule

Powered by VideoKen

Session video

Submission video

Schematise (formerly, "Complianalyse")

Submitted Jan 27, 2024

Category: AI for legal

An LLM enabled XML generator for Indian statutes and laws in the Akoma Ntoso format.

While originally meant as a compliance mapper, this project’s author, guided by the Unix philosophy of “Doing one thing and doing it well” has decided to focus on creating something more modular, rather than focus on a single use-case. Accordingly, repository details have been amended with struck out text where it was only applicable to the previous approach. However, most of the progress made is applicable to the new goal of the project.

Link to README on the Github repository (containing Roadmap and TO-DOs)

https://github.com/sankalpsrv/Schematise/blob/main/README.md

Go to the Dev branch (click here) to view more frequent updates.

Click here to view progress made so far

Problem

Further utilisation of machine readable laws necessitates the conversion of entire statutes into a set of formats known as “Akoma Ntoso”. Currently, there are the following separate solutions locatable on the web/Github/Gitlab that help perform schema generation in different local contexts.

Indigo platform, an open-source text editor for manually annotating any document;
a partially open source project for automating statute conversion to LegalDocML format - Bungeni;
and an open-sourced schema generator for Greek Judgments and Opinions.

Hence, there is a need for a comprehensive automated solution for Indian laws, that is open-sourced, up-to-date with AI capabilities, and covers both LegalDocML as well as LegalRuleML.

Proposed solution

Considering that LegalRuleML and LegalDocML exist as a solution to encode legal statutes into text, the app shall generate XML output. I will do so by utilising an LLM based approach to generate an entire statute’s XML.

Features

Users can upload text files/pdfs and download AkomaNtoso compliant output in XML format.
Users will be able to choose either OpenAI, prompt-engineering, RAG, or a fine-tuned model, since each can generate different outputs and have different inference costs/resource requirements.
Users will be able to validate the XML generated, as well as browse and query the XML in a web interface, using open-source integrations (discussed in section below)
Further modularity is likely to be included as the project goes further.

Progress

Was able to test the few-shot learning in-context learning approach via LangChain and OpenAI. Due to token limits on LangChain, using a split approach was necessary.
Nevertheless, was able to generate LegalRuleML code via this approach - available in this notebook (click here)
The test output generated for the entire Bio-medical Waste Rules is here - testbmw.txt
This was generated using the script in the “src” folder of this repository (click here)
Tested Llama through LangChain and AzureML Endpoints, found that it works via Azure, while LangChain presents some difficulties click here to view notebook
Tested Llama2-7b-chat on few shot prompting using the examples given in the LegalRuleML documentation, found that it produces better output.
Wrote a set of functions that compare similarity scores for the XML file generated sequentially. This will be helpful for identifying places where there is overlap in case of Few-Shot Prompting.
Wrote a RAG approach using the metamodel and descriptions of schemathat can be used for tweaking LegalRuleML generated earlier.
Opened a dev branch - contains work on putting together the developed components. So far have uploaded the Azure Machine Learning Endpoints approach, will be adding more with local deployment.
Tested various models and their quantisations for context windows, GPTQ quantised version accepts a larger context and the output is uploaded in a notebook

Ethical considerations

App shall provide a disclaimer before executing and at the generated results in each case regarding the results not constituting legal advice.
No user data will be sought or stored in any place. The database integration will store the inference results for each statute.

Resource constraints

I am working on a Cloud GPU when testing “local inferencing” via Llama2, and I am attempting to work with quantised models at this stage. However, OpenAI seems to provide a much more feasible deployment scenario. I will expand on capacities later, if required and for fine-tuning, such as by availing Azure.

For the purposes of app showcase I intend on deploying via Cloud GPU, to the extent possible. Alternatively, will run on OpenAI as it is already integrated in the LangChain workflow.

LLMs being compared

HuggingFace’s Inference API will be made use of, in addition to Azure or any other comparable compute resources provider.

Llama (useful because of its Grammars implementation)
I have been able to generate similar output from Llama’s 13b and 7b models via few-shot prompting.
GPT
I am using GPT3.5 for some idea testing and it has been delivering results consistently so far. I have shared these in my notebook on the GitHub repository

The following models will be considered later, if required

Mixtral 7b instruct fine tuned
BERT models
- LegalBert (However, this paper suggests that auto-encoding models perform lesser than autoregressive ones on this task)
- InLegalBert (shown to perform better on Indian laws)

Open-source integrations

Code/algorithms from previous open-source projects can be reused, such as the code for preprocessing and transformations from the “judgmentstoAKN” project. Further integrations available here are enlisted below,
The project will be integrated with XML schema viewers (at least one of the below):
- Online-xsd-viewer or any other comparable general-purpose browser
- Bungenix’s Akoma Ntoso Schema browser
- LRML Search - for performing many handy functions including storing and viewing the LRML schema
Grammar constraining options
- Java class representation for the Akoma Ntoso
- the ANTLR rule to JSON converter provided by Bungeni can possibly be leveraged for generating Llama Grammars
Validation options
- AkomaNtoso Subschema Generator which allows users to specify rules for constraint, ensuring consistency with what they have defined earlier.
- Bungenix’s Schema context viewer - for programmatically validating any generated schema with AKN version 3.0

Other relevant details

In context learning prompts, datasets for training, and other relevant details, will be shared as they are generated.
Integrations will be considered as and when the app proceeds in development. Including PostgreSQL database integration for storing the generated schemas.
Limitations for automated Legal to XML markup have been observed here

All submissions

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}

{{ gettext('New comment') }}

{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Dec 2023

18 Mon

19 Tue 05:30 PM – 06:30 PM IST

20 Wed

21 Thu

22 Fri

23 Sat

24 Sun

Jan 2024

1 Mon

2 Tue

3 Wed

4 Thu

5 Fri 05:30 PM – 07:20 PM IST

6 Sat

7 Sun

Jan 2024

8 Mon 06:00 PM – 06:55 PM IST

9 Tue

10 Wed 06:00 PM – 07:00 PM IST

11 Thu

12 Fri 06:00 PM – 07:30 PM IST

13 Sat 03:00 PM – 06:00 PM IST

14 Sun

Jan 2024

22 Mon

23 Tue

24 Wed

25 Thu

26 Fri

27 Sat 05:00 PM – 05:45 PM IST

28 Sun

Feb 2024

29 Mon

30 Tue

31 Wed

1 Thu

2 Fri

3 Sat 10:00 AM – 06:25 PM IST

4 Sun

Feb 2024

5 Mon

6 Tue

7 Wed 08:15 PM – 09:00 PM IST

8 Thu

9 Fri

10 Sat

11 Sun

Feb 2024

12 Mon 08:15 PM – 09:00 PM IST

13 Tue 08:15 PM – 09:00 PM IST

14 Wed 08:15 PM – 09:00 PM IST

15 Thu 08:15 PM – 09:00 PM IST

16 Fri 07:30 PM – 08:30 PM IST

17 Sat 08:15 PM – 09:00 PM IST

18 Sun

Feb 2024

19 Mon

20 Tue

21 Wed 08:30 PM – 09:15 PM IST

22 Thu

23 Fri

24 Sat

25 Sun

Mar 2024

4 Mon

5 Tue

6 Wed

7 Thu

8 Fri

9 Sat 07:00 PM – 09:00 PM IST

10 Sun 04:00 PM – 06:00 PM IST

Apr 2024

8 Mon

9 Tue

10 Wed

11 Thu

12 Fri 12:00 PM – 06:25 PM IST

13 Sat

14 Sun

Hybrid access (members only)

Hosted by

Hack Five

Hack Five

The Fifth Elephant hackathons

Supported by

Host

The Fifth Elephant

The Fifth Elephant

Jump starting better data engineering and AI futures

Meta

Meta

Venue host

Hasura

Welcome to the events page for events hosted at The Terrace @ Hasura. more

Partner

Microsoft for Startup's

Microsoft for Startup's

Providing all founders, at any stage, with free resources to build a successful startup.