As AI agents become more capable, they increasingly rely on third-party skills—packaged capabilities that extend their functionality. These skills, available through platforms like ClawHub and skills.sh, can range from file operations to network access, but their quality and safety are often uncertain. A new tool called SkillTester, developed by researchers at Peking University and Northwestern Polytechnical University, addresses this gap by providing a systematic way to evaluate both the utility and security of agent skills. This is crucial because, as noted in the paper, a malicious skill can lead to data exfiltration or unauthorized system access, with real-world audits already uncovering hundreds of flawed or malicious skills in public repositories.
SkillTester's key finding is that it can assign clear scores to skills based on how much value they add compared to a baseline and how they behave under security tests. The tool uses a comparative utility principle, meaning it doesn't just check if a skill works in isolation; instead, it compares outcomes with and without the skill under the same conditions. For example, if a skill enables a task that fails without it, that's considered high utility. Additionally, it runs controlled security probes to detect unsafe behaviors like unauthorized file access or data leakage, grouping into categories such as abnormal behavior control and permission boundary. This dual approach helps users make informed decisions about which skills to enable, reducing reliance on indirect signals like popularity or developer reputation.
Ology behind SkillTester involves two main components: paired utility evaluation and a separate security probe suite. For utility, the framework runs tasks in two conditions: a baseline without the skill and a with-skill condition, using the same model and environment. Tasks are authored based on skill analysis, covering common use cases and edge cases, and each must have explicit objectives and pass criteria. The tool checks if the skill is actually invoked and compares success rates and efficiency metrics like token cost and elapsed time. For security, it uses a controlled probe suite with three directions—abnormal behavior control, permission boundary, and sensitive data protection—designed to mimic adversarial scenarios. These probes are structured to verify claims rather than trust them, aligning with threat models like the OWASP Agentic Top 10.
From SkillTester are presented as a utility score and a security score with a three-level status label (Pass, Caution, or Risky). The utility score is computed by averaging task-level scores, where tasks can receive full credit (100) if the skill succeeds and the baseline fails, or scaled credit based on relative efficiency if both succeed. For instance, under default parameters, a task with equal cost between skill and baseline scores 50, while more efficient skill use can score up to 100. The security score is an average of pass rates across the three probe groups, with a status label assigned based on a threshold; currently, scores below 80 are labeled Risky. The paper provides hypothetical examples, such as a utility score of 74.0 indicating meaningful value but not full incremental benefit, and a security score of 92.0 with a Caution status showing most probes pass but not perfectly.
Of SkillTester are significant for the growing ecosystem of AI agent skills, where users often lack reliable ways to assess quality and risk. By providing structured, comparative evidence, the tool can help prevent the adoption of malicious or ineffective skills, which the paper notes have already been found in public audits—like the Snyk report identifying 534 skills with critical security issues out of 3,984 examined. This is particularly important as skills can involve risky actions like shell execution or network access, inheriting threats such as indirect prompt injection. For everyday users and developers, SkillTester offers a practical quality-assurance harness, making skill selection more transparent and safer, potentially supporting future applications like release qualification or trust signaling in agent-first software engineering.
However, the tool has limitations outlined in the paper. Its security coverage is not exhaustive; it relies on controlled probes rather than formal verification, meaning it may not catch all possible threats. The utility evaluation depends on task authoring, which requires skill analysis and may not cover every use case. Additionally, the scoring uses default parameters (e.g., η=50 for neutral efficiency, θs=80 for security thresholds) that are intended as a starting point and may need refinement based on broader data. The framework also excludes runtime monitoring and adaptation, focusing solely on selection and enablement decisions. As the skill ecosystem evolves, these limitations highlight areas for future improvement, such as expanding probe categories or adjusting calibration against human judgment.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn