A new artificial intelligence system has demonstrated remarkable success in generating Python code from instructions written in Bangla, a language spoken by over 270 million people worldwide. This breakthrough addresses a significant gap in AI tools, which have traditionally been English-centric, limiting access for non-English speakers and hindering our understanding of how linguistic factors affect code generation. The system, developed by researchers from Johannes Gutenberg University Mainz and other institutions, won first place in the BLP-2025 Shared Task on Code Generation from Bangla Instructions, showcasing a practical solution to a long-standing imbalance in programming assistance.
The key finding of this research is that a two-agent pipeline, combining code generation with selective debugging based on unit tests, can achieve a Pass@1 score of 95.4% on a test set of 500 Bangla instructions. This metric measures the proportion of instructions for which a single generated code passes all provided unit tests, indicating functional correctness. Without this approach, the best-performing model, GPT-5, only achieved a 64.6% Pass@1 score in the initial coding stage, highlighting the critical role of test-driven refinement in improving accuracy. The system's success underscores how structured feedback, such as error traces and additional test cases, can significantly boost AI performance in code synthesis for underserved languages.
Ology involves a multi-agent pipeline where a code-generation agent first produces an initial Python solution from a Bangla instruction, including function and argument names. This code is then executed against pytest-style unit tests provided in the dataset. If any tests fail, only those failing cases, along with error traces and the current program, are forwarded to a debugger agent. This debugger conditions on the instruction, test suite, and error messages to generate a revised solution, minimizing unnecessary changes. The researchers used proprietary APIs from OpenAI, Google, and Anthropic, with GPT-5 selected as the primary model due to its superior performance in development tests. An external dataset from Austin et al. (2021) was also incorporated to augment test cases, covering 480 out of 500 instructions.
Analysis reveals that the debugger agent dramatically improved performance across all tested models. For GPT-5, the Pass@1 score increased from 64.6% in Stage 1 to 95.4% in Stage 2, a 47.67% improvement. Other models showed gains as well: GPT-4.1 increased from 58.0% to 82.6%, Claude Sonnet 4 from 58.2% to 79.0%, and Gemini-2.5-Flash from 52.6% to 59.8%. The use of external test data was crucial, as without it, GPT-5's score dropped to 86.0%, indicating that more unit tests help the model generate more generalized code. However, the system showed signs of overfitting to provided tests, with a 99.8% Pass@1 on development data but only 95.4% on hidden tests, suggesting s in handling edge cases. Translation experiments, where Bangla instructions were converted to English, yielded mixed : for GPT-5, performance decreased slightly in both stages, while other models saw minor improvements in Stage 2, indicating that translation can sometimes clarify instructions but may also lose task-specific information.
Of this work are substantial for making programming tools more accessible globally. By effectively supporting Bangla, the system opens doors for developers with limited English proficiency to use AI-assisted coding, potentially boosting innovation in regions where this language is prevalent. The test-driven approach also offers a blueprint for improving code generation in other low-resource languages, as it relies on executable feedback rather than text-only metrics, ensuring functional correctness. This could lead to more equitable AI development, reducing the digital divide and fostering diverse contributions to technology. Moreover, the research highlights the importance of linguistic diversity in AI benchmarks, encouraging further exploration of how script variation and code-mixing impact software synthesis.
Limitations of the study include its exclusive focus on proprietary models, which limits reproducibility and generalizability, as open-source models were not evaluated. The reliance on an external dataset for test augmentation means that without it, performance drops by approximately 10 percentage points, indicating a dependency on additional resources. There are also substantial performance differences across proprietary models, but lack of transparency into their training data makes it difficult to attribute these variations to specific factors. Additionally, the system may overfit to provided unit tests, as evidenced by the gap between development and test scores, suggesting a need for better generalization techniques. These constraints point to areas for future research, such as incorporating open-source models and developing more robust testing frameworks to handle edge cases in multilingual code generation.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn