AI Cuts Legal Data Noise by 73%

Understanding human smuggling networks is critical for security and policy, but legal documents are notoriously dense and ambiguous, making automated analysis difficult. A new AI method, CORE-KG, systematically reduces noise and duplication in knowledge graphs built from these texts, offering clearer insights into criminal operations.

Researchers found that combining coreference resolution with structured prompting significantly improves the quality of knowledge graphs derived from legal case documents. Coreference resolution unifies different mentions of the same entity—such as 'Defendant Lewis' and 'Lewis'—into a single node, while structured prompts guide AI to extract only relevant entities and relationships, filtering out irrelevant details like court procedures. This dual approach minimizes redundant and noisy data, which previously cluttered graphs and hindered analysis.

The methodology involves processing legal texts through a modular pipeline. First, a coreference resolution module uses a large language model to consolidate entity references by type—Person, Location, Route, etc.—sequentially to avoid confusion. Then, a knowledge graph construction step employs domain-specific prompts to extract relationships, with instructions to suppress boilerplate and focus on smuggling-related elements. This sequential extraction and filtering ensure that the AI maintains precision without diluting attention across multiple entity types.

Results from ablation studies on 20 legal cases show that removing structured prompts leads to a 73.33% increase in noisy nodes, while disabling coreference resolution causes a 28.32% rise in node duplication. For instance, in one case, the full CORE-KG system reduced noise to 16.65%, compared to 28.86% without structured prompts. The graphs generated with both components are more coherent and interconnected, with a higher relationship-to-node ratio of 2.88, indicating better structural clarity for analyzing networks like smuggling routes and actors.

This advancement matters because it enables more accurate mapping of complex criminal networks from public legal records, aiding law enforcement and policymakers in disrupting illicit activities without manual data cleaning. By providing cleaner graphs, the method supports faster, evidence-based decisions in security contexts.

Limitations include the reliance on existing legal documents without gold-standard annotations, and the method's performance may vary with highly ambiguous or novel text patterns. Future work could explore dynamic adaptations and cross-case coreference to further enhance graph accuracy.

AI Cuts Legal Data Noise by 73%

About the Author

Guilherme A.