Open Legal AI Datasets and Benchmarks | JudicialMind

Catalogue

Datasets we publish.

All artifacts are hosted on our Hugging Face organisation and can be loaded directly with datasets or fetched via the Hub API.

DatasetAvailable

JudicialMind Legal Training Dataset

A large-scale, multilingual query-passage corpus for training legal IR and QA systems.

3.69 million annotated query-passage pairs across 35 languages, covering case law, statutes, regulations, and contracts. Rich row-level metadata (query type, legal domain, difficulty, jurisdiction) supports clean train / validation / test partitioning via an A / B / C bucket split.

3.69M

Pairs

35

Languages

~2.6 GB

Size

264

Parquet shards

legalmultilingualretrievalragquestion-answeringsemantic-searchlegal-reasoning

License · CC BY-NC-ND 4.0Open on Hugging Face

DatasetAvailable

India Acts - Central & State Statutes

A comprehensive PDF corpus of Indian legislation, Central and State, in English and Hindi.

12,102 PDF files spanning all 28 States and 8 Union Territories plus Parliament Acts from 1836 to 2025. Consolidated from the India Code portal and individual state-legislature sources. Intended for statutory retrieval, legal QA, summarization, OCR / parsing benchmarks, multilingual legal NLP and citation analysis. Structured by Central vs. State, language, year of enactment and act title for clean navigation.

12,102

PDF files

~21.7 GB

Size

EN / HI

Languages

36

Jurisdictions

legalindiaactsstatutesbilingualgovernmentsummarizationretrieval

License · Other (see dataset card)Open on Hugging Face

DatasetComing soon

Legal Reranking Corpus

Cross-jurisdictional pairwise relevance judgements for reranker training.

Coming soon. A curated set of hard-negative triples for training legal cross-encoders and rerankers, with calibrated relevance labels across case law and statutory passages.

TBA

Triples

Multi

Languages

legalrerankingcross-encoderhard-negatives

License · CC BY-NC-ND 4.0Follow the org

License

Released under Creative Commons BY-NC-ND 4.0.

Free for academic, research, and non-commercial use with attribution. For commercial licensing or derivative rights (including model training for commercial products), contact us at research@judicialmind.ai.

Get involved

Collaborate on the research corpus powering legal AI.

View on Hugging Face Request early access

FAQ

Common questions.

Can I use these datasets for commercial purposes?

The primary corpus is released under CC BY-NC-ND 4.0, which permits academic and non-commercial use with attribution. For commercial licensing, contact research@judicialmind.ai.

What format are the datasets in?

The legal training dataset ships as Parquet shards, loadable directly with the Hugging Face datasets library. The India Acts corpus is a collection of original PDF files.

How often are datasets updated?

Updates ship as new versions on Hugging Face. The training dataset is versioned by release; the India Acts corpus is updated as new legislation is published. Follow the org on Hugging Face for notifications.

Can I contribute annotations or corrections?

Yes. Open an issue or discussion on the relevant Hugging Face dataset page, or email research@judicialmind.ai with the details.

Open datasets for legal AI research and evaluation.

Datasets we publish.

JudicialMind Legal Training Dataset

India Acts - Central & State Statutes

Legal Reranking Corpus

Load in one line.

Collaborate on the research corpus powering legal AI.

Common questions.

Can I use these datasets for commercial purposes?

What format are the datasets in?

How often are datasets updated?

Can I contribute annotations or corrections?