These datasets were designed for privacy/anonymization but contain high-quality annotations for entity types like phone numbers, email addresses, and physical addresses that are underrepresented in traditional NER datasets.
| Attribute |
Detail |
| Source |
Ai4Privacy project |
| Size |
200k to 500k samples (multiple versions) |
| PII classes |
54 types (200k version), 27 types in OpenPII subset |
| Domain coverage |
Business, education, psychology, legal |
| License |
Open |
| Access |
load_dataset("ai4privacy/pii-masking-200k") — HuggingFace |
Key entity types in the 54-class taxonomy:
| Category |
PII Types |
| Names |
FIRSTNAME, LASTNAME, PREFIX, MIDDLENAME |
| Contact |
PHONENUMBER, EMAIL, URL |
| Address |
STREETADDRESS, CITY, STATE, ZIPCODE, COUNTY, BUILDINGNUMBER |
| Financial |
CREDITCARDNUMBER, CREDITCARDCVV, IBAN, BITCOINADDRESS, CURRENCYCODE, CURRENCYNAME, CURRENCYSYMBOL, AMOUNT |
| Dates |
DATE, DATEOFBIRTH, TIME |
| Identity |
SSN, DRIVERSLICENSE, PASSPORT, GENDER, SEX, NEARBYGPSCOORDINATE |
| Digital |
IPV4, IPV6, USERAGENT, MAC, IMEI, USERNAME, PASSWORD |
| Organization |
COMPANYNAME |
| Vehicle |
VEHICLEVRM, VEHICLEVIN |
Format: Token classification — each token in natural language text labeled with its PII type.
Why it's valuable: The primary open dataset for phone number and email address entity annotations. Also covers person names, physical addresses (with granular subtypes), organizations, and dates. The legal domain subset is directly applicable to legal document extraction evaluation.
Links:
| Attribute |
Detail |
| Source |
Patronus AI / Databricks |
| Size |
3,000 annotated examples |
| Document types |
Meeting notes, commercial contracts, marketing emails, performance reviews |
| License |
Available via MosaicML/Databricks llm-foundry |
Entity types: Revenue figures, customer accounts, salary details, contact information (emails, phone numbers), company-specific data, customer-specific information.
Why it's valuable: Enterprise-focused PII in realistic business document types (contracts, meeting notes, emails). The document types commonly appear in legal discovery and evidence collections.
Links:
| Attribute |
Detail |
| Source |
Academic (NAACL 2025, Student Research Workshop) |
| Entity types |
name, email, username, phone number, URL, address, identifier |
| Approach |
Synthetic generation with controlled entity distribution |
Why it's valuable: Synthetic but well-labeled. Controlled generation means balanced entity type distribution (unlike real datasets where some types are rare).
Links:
| Attribute |
Detail |
| Source |
Academic research (Mendeley Data) |
| Document types |
Financial documents |
| PII types |
Names, SSNs, credit card numbers, phone numbers, email addresses, physical addresses, company names |
| License |
Open (Mendeley) |
Why it's valuable: Financial documents share structural patterns with legal financial evidence. Fully synthetic (no real data), making it safe for CI/CD pipelines.
Links:
| Attribute |
Detail |
| Source |
Pixie Labs / open-source |
| Size |
Protocol traces with 60+ PII types |
| Format |
Token-wise labeled samples from JSON, SQL, HTML, XML |
| License |
Open, on HuggingFace |
Why it's valuable: PII labeled in structured data formats (JSON, SQL, XML), not just natural language. Useful if your extraction pipeline processes structured legal documents or database exports.
Not a dataset but a family of NER models fine-tuned to recognize 60+ categories of PII/PHI. Standard PII (names, emails, phone numbers) plus domain-specific identifiers (medical record numbers, insurance IDs, etc.).
Useful as a baseline model to benchmark LLM-based extraction against. If a lightweight NER model achieves 90% recall on phone numbers, your LLM prompts should match or exceed that.
| Attribute |
Detail |
| Source |
Stanford HAI + 40 contributors (open science project) |
| Published |
NeurIPS 2023 |
| Size |
162 tasks |
| License |
Open |
| Access |
load_dataset("nguha/legalbench") — HuggingFace |
Task types: Binary classification, multi-class classification, extraction, generation, entailment.
Text types: Statutes, judicial opinions, contracts, regulations.
Legal areas covered: Evidence, contracts, civil procedure, criminal law, corporate law, constitutional law, international law, and more.
Construction: Tasks designed and hand-crafted by legal professionals. Three sources: (1) existing legal NLP datasets, (2) novel tasks from legal experts, (3) existing legal education materials.
Why it's valuable: The broadest legal reasoning benchmark available. The extraction tasks provide ground truth for fact extraction evaluation. The classification tasks provide ground truth for document classification. 162 tasks means extensive coverage of legal reasoning scenarios across multiple areas of law.
Links:
| Attribute |
Detail |
| Source |
Academic (arXiv, March 2026) |
| Size |
Large-scale corpus from US caselaw |
| Task |
Legal argument mining (extracting structured arguments from case text) |
| Jurisdiction |
US |
Why it's valuable: Argument mining extracts claims and supporting evidence from legal text — structurally similar to fact extraction (assertion + source evidence). This is brand new (March 2026) and represents the latest in legal information extraction research.
Links:
| Attribute |
Detail |
| Source |
Academic research using NyOn (Nyaya Ontology) |
| Tasks |
Entity extraction, relation extraction, triple construction |
| Format |
RDF triples (subject, predicate, object) |
| License |
CC BY 4.0 |
Why it's valuable: One of the few legal datasets with labeled entity-to-entity relationships. The triple format (entity, relationship_type, entity) directly models the kind of relationships legal extraction systems need to identify.
| Attribute |
Detail |
| Source |
Academic (arXiv, March 2026) |
| Task |
Benchmarking retrieval-augmented generation on legal documents |
Links:
| Attribute |
Detail |
| Source |
University of Copenhagen |
| Size |
7 sub-datasets |
| License |
Open |
| Access |
load_dataset("coastalcph/lex_glue") — HuggingFace |
Sub-datasets:
- 2 single-label text classification datasets
- 4 multi-label text classification datasets
- 1 multiple-choice question answering dataset
Why it's valuable: The multi-label classification tasks are directly applicable to evaluating legal document classification systems. Documents assigned to multiple categories (similar to multi-tag classification in evidence review).
Links:
Mentioned in community discussions as a key legal AI evaluation benchmark. Models are ranked on legal reasoning tasks using US case law. Referenced by ValsAI in evaluations of Claude, GPT, Gemini, and other models.
| Attribute |
Detail |
| Source |
Academic (arXiv, 2025) |
| Task |
Evaluating quality of LLM-generated legal text |
| Dimensions |
Relevance, accuracy, structure, expression |
Why it's valuable: A rare benchmark focused on quality of legal text generation, not just correctness. Useful for evaluating summary generation quality.
Links:
| Attribute |
Detail |
| Source |
Stanford RegLab + collaborators |
| Size |
689GB corpus, 24 languages, 17 jurisdictions |
| Purpose |
Pre-training corpus (not labeled for NER/extraction) |
Not directly useful for evaluation (no entity labels), but demonstrates the scale of available legal text. Useful for future fine-tuning or RAG augmentation.
Links:
| Attribute |
Detail |
| Source |
Unstructured.io |
| Content |
Real-world documents with expert annotations |
| Annotation quality |
Manually annotated by domain experts (not algorithmically generated from metadata) |
Why it's valuable: Measures actual document parsing/extraction quality against ground truth. Every document has expert annotations for text regions, tables, headers, and structure.
Links:
| Attribute |
Detail |
| Source |
OpenDataLab |
| Published |
Accepted at CVPR 2025 |
| Content |
Comprehensive document parsing evaluation |
Provides benchmarks across multiple OCR models and document types.
Links:
dvgodoy/CUAD_v1_Contract_Understanding_PDF on HuggingFace — the same 510 CUAD contracts but as original PDFs rather than pre-extracted text. Enables end-to-end evaluation: PDF to OCR/extraction to entity extraction, compared against CUAD's gold-standard annotations.
Medical records frequently appear in legal contexts (personal injury, custody disputes, malpractice). Clinical NER datasets contain entity annotations in document types that overlap with legal evidence.
Key challenges and their entity types:
| Challenge |
Year |
Entity Types |
Task |
| i2b2/VA |
2010 |
Medical problems, tests, treatments |
Concept extraction + assertion classification + relation classification |
| i2b2 Temporal |
2012 |
Clinical concepts (problems, tests, treatments, departments) + temporal expressions (dates, times, durations, frequencies) + events (admissions, transfers) |
Temporal relation extraction |
| n2c2 ADE |
2018 |
Drug, strength, dosage, duration, frequency, form, route, reason, adverse drug events |
Adverse drug event extraction |
Deidentification annotations: The original clinical notes were deidentified by replacing PII with realistic surrogates. The deidentification process itself annotated: patient names, dates, phone numbers, addresses, ages, medical record numbers, hospital names. These annotations mark the positions of PII entities even though the values are replaced.
Why it's valuable: The 2012 temporal challenge provides expert-annotated temporal expressions (dates, times, durations, frequencies) in clinical text — directly relevant to date extraction in medical records. The deidentification annotations provide labeled phone number and address positions.
Links:
| Attribute |
Detail |
| Source |
Academic (arXiv, 2022) |
| Task |
Entity and attribute recognition in clinical concepts |
Links:
A cross-reference showing which datasets cover which common legal document entity types:
| Entity Type |
CUAD |
OntoNotes |
ai4privacy |
InLegalNER |
Few-NERD |
E-NER |
AsyLex |
SEC Lit KG |
n2c2 |
EntPII |
Sources |
| Person |
Parties |
PERSON |
FIRSTNAME, LASTNAME |
JUDGE, LAWYER, PETITIONER, RESPONDENT, WITNESS, OTHER_PERSON |
8 person subtypes |
person |
Yes (20 types) |
Violator |
Deidentified |
Yes |
10+ |
| Organization |
Parties |
ORG |
COMPANYNAME |
ORG, COURT |
10 org subtypes |
organisation, government, court, business |
Yes |
— |
Hospital names |
Yes |
8+ |
| Location |
— |
GPE, LOC, FAC |
STREETADDRESS, CITY, STATE, ZIPCODE |
GPE |
7 location subtypes |
location |
Yes |
— |
Addresses |
— |
6+ |
| Date |
Effective Date, Expiration Date |
DATE, TIME |
DATE, DATEOFBIRTH |
DATE |
— |
— |
Yes |
Date |
Temporal expressions |
— |
6+ |
| Monetary Amount |
Liability caps, fees |
MONEY |
AMOUNT |
— |
— |
— |
— |
Fine |
— |
Revenue, salary |
4 |
| Phone Number |
— |
— |
PHONENUMBER |
— |
— |
— |
— |
— |
Deidentification markers |
Contact info |
3 |
| Email Address |
— |
— |
EMAIL |
— |
— |
— |
— |
— |
— |
Emails in contracts |
2 |
| Document Reference |
Referenced agreements, IP |
LAW, WORK_OF_ART |
— |
STATUTE, PROVISION |
— |
legislation/act |
— |
— |
— |
— |
4 |
| Event |
— |
EVENT |
— |
— |
6 event subtypes |
— |
— |
— |
Clinical events |
— |
3 |
| Case Reference |
— |
— |
— |
CASE_NUMBER, PRECEDENT |
— |
— |
— |
— |
— |
— |
1 |
All 10 entity types are covered by at least 1 labeled dataset. 8 of 10 are covered by 3+ datasets.
| Attribute |
Detail |
| Type |
Open-source CLI + library for LLM evaluation and red-teaming |
| Language |
JavaScript/TypeScript (CLI), language-agnostic via YAML configs |
| Install |
npm install -g promptfoo |
| GitHub stars |
15k+ |
Core capabilities:
- Declarative YAML configs defining prompts, providers, test cases, and assertions
- Assertion types:
- Deterministic:
is-json, contains, not-contains, equals, regex
- Programmatic:
javascript (custom logic), python (custom logic)
- Budget:
cost (threshold per test case)
- LLM-judged:
llm-rubric (natural language scoring rubric evaluated by another LLM)
- Model comparison:
similar (semantic similarity between outputs)
- Model comparison matrix: Run same test cases across multiple prompts/models, get side-by-side comparison
- CI/CD integration: GitHub Actions, JSON/HTML/XML output, before/after comparison on PRs
- Red-teaming: Built-in vulnerability scanning for prompt injection, jailbreaks, PII leakage
- Platform integrations: Langfuse (via
langfuse:// prefix), LangSmith, custom providers
Links:
| Attribute |
Detail |
| Type |
Built-in evaluation module of the Langfuse LLM platform |
| License |
Open-source (MIT) |
Core capabilities:
- Datasets: Create collections of input/expected_output pairs as test cases, version them
- Experiments: Run LLM applications against datasets, compare results across runs
- LLM-as-a-Judge: Configure evaluator prompts in UI or SDK that automatically score traces. Supports managed evaluators and custom evaluator templates.
- A/B testing: Label-based prompt variant assignment with metric tracking per variant
- Human scoring: Manual labeling interface for domain experts to score outputs
- Pytest integration:
run_experiment() in test suites with assertions on aggregate scores
- Score types: NUMERIC (0.0-1.0), CATEGORICAL (labels), BOOLEAN (pass/fail)
- Automated evaluator triggers: Evaluators can auto-run on new experiment runs
Links:
These two tools integrate natively:
- Prompt fetching: Promptfoo uses
langfuse://prompt-name@label syntax to pull prompts from Langfuse
- Variable mapping: Promptfoo test case
vars automatically map to Langfuse {{ variable }} placeholders
- Version/label selection:
langfuse://my-prompt — latest production version
langfuse://my-prompt:3 — specific version (numeric = version)
langfuse://my-prompt:staging — specific label (string = label)
- Environment variables:
LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_BASE_URL
Links:
| Tool |
Type |
Best For |
| DeepEval |
Python evaluation framework |
Hallucination detection, faithfulness metrics |
| RAGAS |
RAG evaluation framework |
Retrieval quality, answer relevance |
| Evidently AI |
ML/LLM monitoring |
Production drift detection, data quality |
| Presidio (Microsoft) |
PII detection SDK |
Baseline comparison for PII entity extraction |
| autoevals (Braintrust) |
LLM evaluation library |
Pre-built evaluators (factuality, coherence, etc.) |
| Metric |
Definition |
Use Case |
| Exact match precision |
Fraction of predicted entities that exactly match a ground truth entity (same text span + same type) |
Strict evaluation — penalizes boundary errors |
| Exact match recall |
Fraction of ground truth entities found by the model (exact text + type match) |
Strict evaluation — measures completeness |
| Exact match F1 |
Harmonic mean of precision and recall |
Primary quality score |
| Partial match precision |
Fraction of predicted entities that overlap with any ground truth entity |
Lenient evaluation — tolerates boundary differences |
| Partial match recall |
Fraction of ground truth entities partially found |
Lenient evaluation |
| Type accuracy |
Among matched entities, fraction with correct type assignment |
Entity typing quality (independent of boundary accuracy) |
| Mention-level validity |
Whether mention_text appears verbatim in source document |
Hallucination detection for LLM-based extraction |
Partial matching is important for LLM-based extraction. LLMs often extract slightly different spans than human annotators (e.g., "Dr. Sarah Chen" vs "Sarah Chen"). Partial match metrics accommodate this without penalizing useful extractions.
| Metric |
Definition |
| Triple exact match precision |
Fraction of predicted (source, type, target) triples that exactly match ground truth |
| Triple exact match recall |
Fraction of ground truth triples that were predicted |
| Type-only evaluation |
Correct relationship type, relaxing entity matching to partial overlap |
| Argument role evaluation |
Correct identification of source vs target roles, regardless of exact span matching |
| Metric |
Definition |
Evaluation Method |
| Assertion accuracy |
Is the stated fact supported by the source document? |
LLM-as-judge or human review |
| Source snippet validity |
Does the attributed snippet appear verbatim in the source? |
Deterministic string matching |
| Confidence calibration |
Do high-confidence facts tend to be correct? Low-confidence uncertain? |
Statistical calibration analysis |
| Fact completeness |
What fraction of important facts from the document were captured? |
LLM-as-judge against reference summary |
| Fact novelty |
Are extracted facts non-trivial (not just restating the obvious)? |
LLM-as-judge |
| Metric |
Definition |
Formula |
| Tag precision |
Fraction of assigned tags that are appropriate |
TP / (TP + FP) per tag, macro-averaged |
| Tag recall |
Fraction of appropriate tags that were assigned |
TP / (TP + FN) per tag, macro-averaged |
| Hamming loss |
Fraction of labels incorrectly predicted (for multi-label) |
XOR(predicted, truth) / num_labels |
| Subset accuracy |
Exact match of the full tag set |
Full set match or not (strict) |
| Macro F1 |
Average F1 across all tag types |
Treats all tags equally regardless of frequency |
| Micro F1 |
F1 computed globally across all tags |
Weights by tag frequency |
| Metric |
Definition |
Evaluation Method |
| Faithfulness |
Summary contains only information present in source |
LLM-as-judge |
| Coverage |
Summary captures key facts, parties, and relevance |
LLM-as-judge against reference |
| Length compliance |
Summary within specified character/word limit |
Deterministic |
| Objectivity |
Summary avoids opinions and legal conclusions |
LLM-as-judge |
| Coherence |
Summary reads naturally and is well-organized |
LLM-as-judge |
| ROUGE-L |
Longest common subsequence overlap with reference summary |
Deterministic (requires reference) |
| BERTScore |
Semantic similarity to reference summary |
Model-based (requires reference) |
LeMAJ (Legal LLM-as-a-Judge) is a 2025 framework published at the ACL NLLP Workshop specifically designed for automated evaluation of legal AI systems. It outperforms both DeepEval and non-LLM evaluation methods on proprietary legal datasets and on LegalBench.
- Segment the LLM's output into individual Legal Data Points (LDPs) — atomic, independently verifiable pieces of information
- Evaluate each LDP independently against the source legal document
- Score each LDP on multiple quality dimensions
- Aggregate LDP scores into overall precision, recall, and F1 metrics
An LDP is the smallest meaningful unit of information that can be verified against a source document. Examples across extraction tasks:
| LDP Type |
Example |
What Makes It Verifiable |
| Entity mention |
"Dr. Sarah Chen" labeled as person |
Does this name appear in the source? Is the type correct? |
| Entity attribute |
Canonical name "Sarah Chen" for mention "Dr. S. Chen" |
Is the canonical form reasonable? |
| Relationship |
"Chen" employed_by "Memorial Hospital" |
Do both entities exist? Is the relationship stated or clearly implied? |
| Fact assertion |
"Patient prescribed 20mg Lisinopril on January 3" |
Is this factually supported by the source text? |
| Classification tag |
Document tagged as custody_relevant |
Is this tag appropriate given the document's content? |
| Temporal expression |
"January 3, 2026" extracted as a date entity |
Is the date correctly parsed and present in source? |
| Monetary value |
"$15,000 settlement" |
Is the amount correct? Is the context (settlement) accurate? |
| Dimension |
Score Range |
Description |
| Presence |
0 or 1 |
Is this LDP grounded in the source document? |
| Accuracy |
0.0-1.0 |
Is the extracted value correct and faithful? |
| Completeness |
0.0-1.0 |
For multi-attribute LDPs, are all attributes correct? |
| Relevance |
0.0-1.0 |
Is this LDP legally meaningful (not noise)? |
LDP_precision = correct_predicted_LDPs / total_predicted_LDPs
LDP_recall = correct_predicted_LDPs / total_ground_truth_LDPs
LDP_F1 = 2 * (precision * recall) / (precision + recall)
This maps directly to standard information extraction metrics but operates at a more granular level than whole-entity or whole-document evaluation.
- Strong LLM judges (GPT-4 class and above) achieve 80-90% agreement with human evaluators on legal quality dimensions
- This agreement rate is comparable to inter-annotator agreement between humans on the same tasks
- LeMAJ outperforms both automated metrics (ROUGE, BERTScore) and simpler LLM evaluation approaches (single-score rubrics)
- The segmentation step (breaking output into LDPs) is crucial — whole-output evaluation misses individual errors
| Attribute |
Detail |
| Type |
Open-source LLM engineering platform |
| Features |
Prompt management, versioning, tracing, evaluation, datasets, experiments, A/B testing |
| Self-hosting |
Docker Compose or Kubernetes |
| Infrastructure |
Requires Postgres + ClickHouse + Redis + S3-compatible storage |
| Caching |
Client-side SDK cache, 60s TTL default, zero-latency background refresh |
| Multi-tenancy |
Label-based prompt variant management |
| GitHub stars |
73k+ |
Why it leads the market (as of March 2026): Cited as standing "heads and shoulders above everyone else in the prompt management space" (Maxim AI, October 2025). Combines prompt management, tracing, and evaluation in one platform. Self-hostable for data-sensitive industries.
Links:
| Attribute |
Detail |
| Type |
Open-source LLMOps platform |
| Versioning |
Git-like (branches/variants with commit history) |
| Features |
Prompt playground, evaluation, observability |
| License |
MIT |
Stronger on experimentation/playground. Weaker on production multi-tenancy compared to Langfuse.
SaaS-only prompt management. Git-inspired version control. Good for domain expert editing. No self-hosting option (dealbreaker for data-sensitive applications).
Prompt evaluation and scoring platform. Notable for the autoevals library with pre-built evaluator functions (factuality, coherence, relevance).
Comprehensive experimentation, evaluation, and observability. Enterprise-focused.
Community research conducted March 2026 across Reddit (r/PromptEngineering, r/AI_Agents — 5 threads, 88 comments), X/Twitter (25 posts, 5,484 likes), and 40+ web pages including Langfuse docs, Promptfoo docs, Anthropic best practices, and academic papers.
-
Separate prompts from application code. Prompts should be modifiable without code deploys. This is the single most impactful architectural change for LLM-powered applications.
-
Treat prompts as versioned artifacts. Every prompt change creates a new immutable version. Old versions are preserved for rollback and comparison. Labels (production, staging, candidate) control which version is active.
-
Cache prompts client-side. SDK-level caching with configurable TTL (typically 60 seconds) prevents prompt fetching from adding latency to LLM calls. Background refresh serves stale prompts while updating, achieving zero-latency operation.
-
Use Jinja2 for template variables. {{ variable }} syntax with SandboxedEnvironment for security. Preferred over Python .format() for prompt management platforms.
-
XML tags for prompt structuring. Anthropic specifically recommends wrapping different content types in XML tags (<instructions>, <context>, <input>) to reduce misinterpretation. Advanced: "salted tags" (session-specific suffix to prevent tag spoofing).
-
Structured output via tool_use is preferred. Pydantic models as tool schemas produce more reliable structured output than asking for JSON in prompt text.
-
Prompt evaluation should run in CI/CD. Every PR that modifies prompts or extraction code should trigger automated evaluation. Use advisory mode initially (failures don't block merges), switch to blocking as the eval suite matures.
-
Combine deterministic and LLM-judged assertions. Deterministic checks (valid JSON, required fields, cost thresholds) catch structural issues. LLM-as-judge catches semantic issues (completeness, accuracy). Both are needed.
-
LLM-as-judge achieves 80-90% agreement with humans. Strong LLM judges are viable for automated evaluation, with agreement rates comparable to inter-annotator agreement between humans.
-
Golden datasets from user feedback are the highest-value evaluation asset. Expert review decisions (accept/reject/edit) on real extractions build ground truth that reflects actual domain usage. This is more valuable than any public dataset because it matches your exact document types and quality expectations.
-
A/B testing requires deterministic assignment. Hash-based assignment (e.g., hash of document_id + experiment_name) ensures the same document always gets the same variant on re-processing. Assignment must be persistent for meaningful comparison.
-
Red-team your prompts. Test for prompt injection (user instructions that override system behavior), template injection (Jinja2 syntax in user input), XML tag escape (closing tags in user content), and system prompt leakage.
-
Track drift over time. Model updates can silently degrade extraction quality. Monitor entity counts, classification distributions, and quality scores on a rolling basis. Alert when metrics deviate beyond 2 standard deviations from baseline.
-
System prompt structure: Role (1 line), Goal (what "done" looks like), Constraints (list), Uncertainty handling ("If unsure: say so explicitly and ask 1 clarifying question").
-
Use 3-5 examples in <example> tags. Format beats adjectives. Showing the desired output format is more effective than describing it.
-
Force structure in output. JSON, bullets, or rubric — explicit output constraints produce more consistent results than open-ended generation.
| Paper |
Year |
Contribution |
| "CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review" |
2021 (NeurIPS) |
Gold-standard contract clause annotations |
| "LexGLUE: A Benchmark Dataset for Legal Language Understanding in English" |
2022 |
Multi-task legal NLU benchmark |
| "The Cambridge Law Corpus: A Dataset for Legal AI Research" |
2023 (NeurIPS) |
250k UK court cases with case outcome annotations |
| "LAMUS: A Large-Scale Corpus for Legal Argument Mining from U.S. Caselaw" |
2026 |
US caselaw argument mining |
| "Legal-DC: Benchmarking RAG for Legal Documents" |
2026 |
Legal RAG evaluation |