Home

Resources / Datasets

Open datasets for legal AI research and evaluation.

We publish the datasets, benchmarks, and evaluation suites behind JudicialMind's research. Built for researchers, practitioners, and anyone working on legal information retrieval, reasoning, and language modelling.

Catalogue

Datasets we publish.

All artifacts are hosted on our Hugging Face organisation and can be loaded directly with datasets or fetched via the Hub API.

DatasetAvailable

JudicialMind Legal Training Dataset

A large-scale, multilingual query-passage corpus for training legal IR and QA systems.

3.69 million annotated query-passage pairs across 35 languages, covering case law, statutes, regulations, and contracts. Rich row-level metadata (query type, legal domain, difficulty, jurisdiction) supports clean train / validation / test partitioning via an A / B / C bucket split.

3.69M
Pairs
35
Languages
~2.6 GB
Size
264
Parquet shards
legalmultilingualretrievalragquestion-answeringsemantic-searchlegal-reasoning
License · CC BY-NC-ND 4.0Open on Hugging Face
DatasetAvailable

India Acts - Central & State Statutes

A comprehensive PDF corpus of Indian legislation, Central and State, in English and Hindi.

12,102 PDF files spanning all 28 States and 8 Union Territories plus Parliament Acts from 1836 to 2025. Consolidated from the India Code portal and individual state-legislature sources. Intended for statutory retrieval, legal QA, summarization, OCR / parsing benchmarks, multilingual legal NLP and citation analysis. Structured by Central vs. State, language, year of enactment and act title for clean navigation.

12,102
PDF files
~21.7 GB
Size
EN / HI
Languages
36
Jurisdictions
legalindiaactsstatutesbilingualgovernmentsummarizationretrieval
License · Other (see dataset card)Open on Hugging Face
DatasetComing soon

Legal Reranking Corpus

Cross-jurisdictional pairwise relevance judgements for reranker training.

Coming soon. A curated set of hard-negative triples for training legal cross-encoders and rerankers, with calibrated relevance labels across case law and statutory passages.

TBA
Triples
Multi
Languages
legalrerankingcross-encoderhard-negatives
License · CC BY-NC-ND 4.0Follow the org

Usage

Load in one line.

Our datasets are hosted on the Hugging Face Hub and compatible with the datasets library.

from datasets import load_dataset

# The multilingual legal query-passage corpus
ds = load_dataset("judicialmind/legal-training-dataset")

# Inspect a record
print(ds["train"][0])

Need authenticated or streaming access? Use streaming=True or pass a Hub token.

License

Released under Creative Commons BY-NC-ND 4.0.

Free for academic, research, and non-commercial use with attribution. For commercial licensing or derivative rights (including model training for commercial products), contact us at research@judicialmind.ai.

Get involved

Collaborate on the research corpus powering legal AI.

FAQ

Common questions.

Can I use these datasets for commercial purposes?

The primary corpus is released under CC BY-NC-ND 4.0, which permits academic and non-commercial use with attribution. For commercial licensing, contact research@judicialmind.ai.

What format are the datasets in?

The legal training dataset ships as Parquet shards, loadable directly with the Hugging Face datasets library. The India Acts corpus is a collection of original PDF files.

How often are datasets updated?

Updates ship as new versions on Hugging Face. The training dataset is versioned by release; the India Acts corpus is updated as new legislation is published. Follow the org on Hugging Face for notifications.

Can I contribute annotations or corrections?

Yes. Open an issue or discussion on the relevant Hugging Face dataset page, or email research@judicialmind.ai with the details.