AI Reads Text for Meaning Without Any Training

TL;DR

A simple math method interprets language without machine learning, matching top AI models while being fully transparent and thousands of times faster.

A new approach to understanding language s the core assumption that artificial intelligence must learn from vast amounts of data to grasp meaning. Researchers have developed a called Harmonic Token Projection (HTP) that generates text embeddings—numerical representations of words and sentences—using only deterministic mathematical rules, without any training, vocabulary, or random parameters. This technique encodes each token as a harmonic trajectory derived from its Unicode integer representation, creating a reversible and interpretable mapping between symbols and continuous vector space. , detailed in a recent paper, suggest that a significant portion of semantic similarity can be captured through pure geometry, offering a transparent and efficient alternative to the opaque, data-hungry neural networks that dominate natural language processing today.

The key is that HTP achieves performance comparable to some trained models on standard benchmarks. On the Semantic Textual Similarity Benchmark (STS-B), HTP with stopword removal reached a Spearman correlation of ρ = 0.70 and a Pearson correlation of r = 0.71, slightly surpassing the base BERT model which scored ρ = 0.68 and r = 0.70. This result is notable because BERT relies on extensive pretraining and billions of parameters, while HTP uses no learned components. In multilingual tests across ten languages, HTP maintained an average Spearman correlation of ρ = 0.64, outperforming classical unsupervised s like Word2Vec (ρ = 0.61) and GloVe (ρ = 0.65), though it falls short of more advanced supervised models like Sentence-BERT (ρ = 0.77). demonstrates that meaningful semantic relations can emerge from deterministic geometry, not just statistical patterns.

Ology behind HTP is based on analytic harmonic functions, treating each text token as a point in a phase space defined by its Unicode representation. First, each character in a token is mapped to its Unicode code point, and the sequence is converted into a deterministic integer identifier. This integer is then decomposed into residues using coprime moduli, with each residue projected onto a harmonic pair via trigonometric functions like sine and cosine. The final embedding vector is the concatenation of these pairs, creating a smooth, periodic representation. For sentence-level embeddings, a pooling mechanism applies weights based on token frequency, similar to TF-IDF, to emphasize semantically informative words. Crucially, the process is fully reversible using the Chinese Remainder Theorem, allowing the original text to be recovered from the embedding with minimal error.

Analysis of reveals HTP's efficiency and scalability. In terms of computational cost, HTP encodes sentence pairs in under 2 milliseconds on a single CPU core, with a memory footprint of less than 1 megabyte. This represents an efficiency improvement of approximately three orders of magnitude compared to transformer-based models like BERT, which require 45 milliseconds on a GPU and 4.3 gigabytes of memory. Ablation studies showed that performance improves with embedding dimensionality, converging near ρ = 0.68 at 512 dimensions, while runtime remains low. also proved robust across languages, with correlations ranging from ρ = 0.553 for Chinese to ρ = 0.668 for English and Italian, indicating its language-agnostic nature without any fine-tuning.

Of this research extend beyond mere efficiency. HTP offers a theoretical contribution by framing semantic structure as an emergent property of harmonic geometry, bridging symbolic computation and continuous vector semantics. This could inform the design of hybrid architectures that combine analytic determinism with contextual learning, potentially leading to more interpretable AI systems. Practically, HTP's reversibility and transparency make it suitable for applications demanding deterministic traceability, such as explainable AI, symbolic compression, or reversible database indexing. It provides a foundation for models that are not only fast and lightweight but also mathematically transparent, addressing growing concerns about the black-box nature of deep learning.

However, the paper acknowledges several limitations. HTP lacks contextual disambiguation, meaning polysemous words like 'bank' receive identical representations regardless of meaning. Linear pooling may dilute compositional meaning in longer sequences, and small distortions in Unicode normalization can introduce discontinuities. Future work could explore multi-scale Fourier embeddings and adaptive frequency modulation to enhance semantic precision. Despite these s, HTP demonstrates that a substantial fraction of linguistic similarity can be reconstructed from symbolic geometry alone, suggesting that meaning, to a surprising extent, may indeed emerge from structure.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn