An LLM enabled XML generator for Indian statutes and laws in the Akoma Ntoso format.
While originally meant as a compliance mapper, this project’s author, guided by the Unix philosophy of “Doing one thing and doing it well” has decided to focus on creating something more modular, rather than focus on a single use-case. Accordingly, repository details have been amended with struck out text where it was only applicable to the previous approach. However, most of the progress made is applicable to the new goal of the project.
Click here to view progress made so far
Further utilisation of machine readable laws necessitates the conversion of entire statutes into a set of formats known as “Akoma Ntoso”. Currently, there are the following separate solutions locatable on the web/Github/Gitlab that help perform schema generation in different local contexts.
Hence, there is a need for a comprehensive automated solution for Indian laws, that is open-sourced, up-to-date with AI capabilities, and covers both LegalDocML as well as LegalRuleML.
Considering that LegalRuleML and LegalDocML exist as a solution to encode legal statutes into text, the app shall generate XML output. I will do so by utilising an LLM based approach to generate an entire statute’s XML.
- Users can upload text files/pdfs and download AkomaNtoso compliant output in XML format.
- Users will be able to choose either OpenAI, prompt-engineering, RAG, or a fine-tuned model, since each can generate different outputs and have different inference costs/resource requirements.
- Users will be able to validate the XML generated, as well as browse and query the XML in a web interface, using open-source integrations (discussed in section below)
- Further modularity is likely to be included as the project goes further.
- App shall provide a disclaimer before executing and at the generated results in each case regarding the results not constituting legal advice.
- No user data will be sought or stored in any place. The database integration will store the inference results for each statute.
I am working on a Cloud CPU notebook and I am attempting to work with quantised models at this stage. I will expand on capacities later, if required and for fine-tuning.
For the purposes of app showcase I intend on deploying via Cloud GPU, to the extent possible.
HuggingFace’s Inference API will be made use of, in addition to Azure or any other comparable compute resources provider.
- Llama (useful because of its Grammars implementation)
I have been able to generate similar output from Llama’s 13b and 7b models via few-shot prompting.
I am using GPT3.5 for some idea testing and it has been delivering results consistently so far. I have shared these in my notebook on the GitHub repository
The following models will be considered later, if required
- Mixtral 7b instruct fine tuned
- BERT models
- LegalBert (However, this paper suggests that auto-encoding models perform lesser than autoregressive ones on this task)
- InLegalBert (shown to perform better on Indian laws)
Code/algorithms from previous open-source projects can be reused, such as the code for preprocessing and transformations from the “judgmentstoAKN” project. Further integrations available here are enlisted below,
The project will be integrated with XML schema viewers (at least one of the below):
Grammar constraining options
- In context learning prompts, datasets for training, and other relevant details, will be shared as they are generated.
- Integrations will be considered as and when the app proceeds in development. Including PostgreSQL database integration for storing the generated schemas.
- Limitations for automated Legal to XML markup have been observed here