AI Web Scraping Extracts Bibliographic Data Faster

TL;DR

Learn how AI-powered web scraping tools automatically pull book and citation data from the web, cutting manual research time significantly.

In a groundbreaking development that could transform how researchers and librarians access vast repositories of information, computer scientists from Universidad Complutense de Madrid have demonstrated how advanced prompt engineering with large language models can generate fully functional web-scraping programs in a single interaction. The research, led by Manuel Blázquez Ochando, Juan José Prieto Gutiérrez, and María Antonia Ovalle Perandones, shows that carefully structured prompts can produce production-ready PHP code for extracting bibliographic data from massive catalogues like Spain's National Library, which contains over 17 million records. This approach eliminates the need for multiple programming iterations and makes sophisticated data extraction tools accessible to researchers without deep technical expertise, potentially democratizing access to bibliographic research on an unprecedented scale.

At the heart of this breakthrough is a meticulously designed prompt structure that combines Role Prompting and Few-shot Prompting techniques. The researchers discovered that simple, direct queries to AI models like ChatGPT-4o typically produce inadequate code—in their control experiment, a basic prompt generated a web-scraper that incorrectly assumed Dublin Core metadata structure and used inappropriate libraries. In contrast, their advanced prompt structure includes five critical sections: role definition, context and purpose, inputs and constraints, input-output examples, and detailed steps. This comprehensive approach guides the AI to produce code that correctly uses cURL functions for HTTP connections, XPath for data extraction, and proper error handling, resulting in functional code from the first interaction.

Demonstrate remarkable effectiveness. Using their advanced prompt with ChatGPT-4o, the researchers generated a PHP web-scraper that successfully processed 55,473 records from the National Library of Spain's catalogue with an average completion rate of 85.43% per record. The system achieved an extraction rate of 66.76 records per minute, processing the entire dataset in approximately 13.85 hours of effective scraping time. Even more impressively, proved interoperable across different AI models—when the researchers used the same prompt structure with Claude Sonnet 3.5 to add database integration and looping functionality, it successfully modified the original ChatGPT-generated code to handle 50,000 records with proper MySQL integration and server-respecting pauses.

This research has profound for bibliometric research and library operations. Traditionally, researchers faced significant barriers in collecting large bibliographic datasets, often requiring weeks of software development or expensive commercial tools. The new allows researchers without programming expertise to develop custom data mining tools for specific research questions—for instance, studying the evolution of artificial intelligence publications across national catalogues. For libraries, it simplifies data migration between systems and enables innovative services like automated bibliographic alerts that monitor new acquisitions across multiple catalogues. The approach also provides a viable alternative when specialized APIs or protocols like OAI-PMH are unavailable, particularly benefiting smaller institutions with limited technical resources.

Despite its impressive , the research acknowledges several limitations. has been primarily validated with Spain's National Library catalogue, though the principles are theoretically applicable to other bibliographic systems. The generated code is limited to PHP and specific technology stacks (Apache, MySQL), and are tied to particular versions of AI models (ChatGPT-4o and Claude Sonnet 3.5). Performance is also constrained by necessary server-respecting measures—the 3-second pauses between requests, while ethically important, increased total execution time significantly. The researchers emphasize that automated bibliographic data extraction requires careful ethical consideration, including respecting institutional data usage policies and implementing appropriate rate controls to avoid server overload while maintaining transparency about extraction purposes and ologies.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn