AIResearch AIResearch
Back to articles
Robotics

AI Sees Through Cat Eyes Better Than Humans Do

Self-supervised vision transformers achieve 81% alignment with feline vision, revealing how AI can bridge species perception gaps without labels.

AI Research
November 05, 2025
3 min read
AI Sees Through Cat Eyes Better Than Humans Do

Imagine an AI that sees the world not just through human eyes, but through the eyes of a cat. This isn't science fiction—researchers have discovered that certain AI models can bridge the perceptual gap between species better than others, with self-supervised vision transformers leading the pack. For anyone curious about how AI might interpret the world from different perspectives, this research offers concrete evidence that machines can develop cross-species visual understanding without explicit training.

The key finding is that self-supervised vision transformers, particularly DINO ViT-B/16, achieve the strongest alignment between human and cat visual representations. When tested on paired images—one filtered to approximate feline vision and one in standard human view—this model showed a mean RBF-Centered Kernel Alignment (CKA) of 0.814, meaning its internal feature patterns were 81.4% similar across species. In comparison, supervised transformers like ViT-B/16 reached 77.6% alignment, while top convolutional neural networks (CNNs) such as EfficientNet-B3 peaked at 70.2%. This indicates that self-supervised learning, which doesn't rely on human-labeled data, fosters features that generalize better across biological vision systems.

To measure this, the team used a frozen-encoder approach, keeping 35 pre-trained models unchanged to isolate their inherent representational biases. They sourced 191 point-of-view videos from cats wearing cameras, extracting over 300,000 frame pairs. Each pair consisted of an original human-view frame and a transformed version simulating cat vision—accounting for factors like reduced color sensitivity, vertical slit pupils, and motion biases. Features were extracted layer by layer from models spanning CNNs, supervised transformers, windowed transformers (e.g., Swin), and self-supervised transformers (DINO variants). Alignment was quantified using CKA and Representational Similarity Analysis (RSA), with distribution shifts tested via Maximum Mean Discrepancy and Wasserstein distance, all under strict statistical controls.

The results, detailed in figures and tables, show that self-supervised transformers not only lead in alignment metrics but also peak early in their processing blocks—DINO ViT-B/16's best alignment occurred at block 0, with a CKA-RBF of 0.814. Supervised transformers aligned best in deeper blocks (e.g., ViT-L/16 at block 14 with 0.806 CKA-RBF), while CNNs like EfficientNet-B3 peaked at stage 5. Despite high alignment, distribution shift tests revealed persistent differences; for instance, DINOv3 ViT-7B/16 had a projected 1-Wasserstein distance over 700 in late blocks, indicating that even well-aligned models don't fully eliminate domain gaps. Visualization through t-SNE and UMAP plots (Figures 3-5) confirmed that self-supervised models produce more overlapping embeddings between human and cat domains, reducing species-specific clustering.

This matters because it demonstrates AI's potential to adapt to diverse sensory inputs, which could improve applications in robotics, wildlife monitoring, or assistive technologies for vision impairments. By understanding how AI aligns with biological vision, developers can design systems that interpret environments from multiple perspectives, enhancing reliability in real-world scenarios where conditions vary. For instance, an AI trained this way might better navigate settings with different lighting or motion patterns, akin to how animals perceive their surroundings.

Limitations include the use of an approximate cat-vision filter rather than real biological data, and the focus on static frames over dynamic vision. The study also doesn't address how these alignments translate to task performance, leaving open questions about practical utility. Future work could extend to other species or incorporate temporal aspects to see if these findings hold in video contexts.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn