Why Embeddings Gather on an Approximate Hypersphere

Article language

If we treat modern embeddings as ordinary Euclidean vectors, it is hard to explain a persistent empirical fact: in many retrieval, representation-learning, and classification systems, vectors do not spread uniformly through the whole interior of the space. They cluster near a high-dimensional shell. In many of the same systems, the most effective similarity measure is not raw Euclidean distance but cosine similarity [1-5].

These two phenomena are not independent. The previous articles already showed that high-dimensional probability compresses norm fluctuations and transfers useful freedom toward directional structure. Training then further weakens radial noise and encourages semantically relevant differences to be encoded in angles. In practice, embedding spaces therefore often look closer to a hyperspherical manifold than to the full ambient $R^{d}$ .

Core view: the approximate hyperspherical distribution of embeddings is usually produced by three forces acting together: high-dimensional statistics already concentrate norms, normalization and contrastive-style objectives further suppress radial freedom, and inner-product readout keeps rewarding directionally readable organization. In many tasks, angle is therefore more stable than length and closer to the semantic structure the representation actually carries [1-5].

1. The first source of shell structure: norm concentration in high dimensions

For a high-dimensional random vector $x \in R^{d}$ whose coordinates are centered and have comparable scale,

∥ x ∥^{2} = \sum_{i = 1}^{d} x_{i}^{2}

typically has expectation growing linearly with $d$ , while its relative fluctuation shrinks like $1 / \sqrt{d}$ [1]. In Gaussian or more generally sub-Gaussian settings, this can be summarized as

∥ x ∥ = Θ (\sqrt{d}) with high probability .

This does not require any semantic structure. It only says that high dimension by itself does not encourage large relative radial differences. A shell is therefore the default stage provided by high-dimensional statistics, not a mysterious shape invented only after training.

This matters because it sets up everything that follows. Only once most vectors already live at comparable radii does angular information begin to dominate length information.

2. The second source of shell structure: training actively reduces radial freedom

Real embeddings are of course not independent random vectors. Training continues to shape their geometry, and in many systems it pushes precisely toward "stable length, informative direction."

Normalization layers explicitly control feature scale and suppress large norm drift.
Contrastive objectives reward local alignment together with global uniformity, which naturally favors angular organization on a hypersphere [2].
Many metric-learning methods optimize decision boundaries directly on the unit hypersphere. NormFace, SphereFace, and ArcFace are all examples where angular margin is central [3-5].

Approximate shell structure is therefore not the side effect of a single module. It is the point where high-dimensional statistics, network normalization, and loss design converge. High dimension first tells the model that radial differences are unstable; the training objective then tells it that if it wants to preserve the most reusable degrees of freedom, it should preserve them in direction. Figure 1 makes that radial compression and angular dominance explicit.

Illustration of shell geometry for embeddings

Figure 1. Once norms are compressed to comparable scales, the main differences between vectors show up as angular differences. This is why shell geometry and cosine similarity are so tightly connected.

Figure 1 does not say that all points have exactly the same length. It says that the main variation no longer comes from radius. That distinction determines when cosine will be a better first-order metric than raw Euclidean distance.

3. Why does cosine similarity become the natural metric?

For two vectors $x, y$ ,

\cos θ = \frac{x^{⊤} y}{∥ x ∥ ∥ y ∥},

and

∥ x - y ∥^{2} = ∥ x ∥^{2} + ∥ y ∥^{2} - 2 x^{⊤} y .

If $∥ x ∥$ and $∥ y ∥$ are both stable near the same radius $r$ , then

∥ x - y ∥^{2} \approx 2 r^{2} (1 - \cos θ) .

So once shell geometry holds, Euclidean distance and cosine similarity are both largely reading the same angular structure. The difference is that Euclidean distance still mixes in residual norm fluctuation, while cosine removes it explicitly. That is why cosine is often more stable and closer to the actual discriminative variable used by the trained representation [2-5].

Geometrically, cosine similarity amounts to projecting vectors to the unit sphere first and then comparing their relative positions there. For embeddings that already lie near a shell, this is not a lucky heuristic. It matches the internal organization of the space.

4. "Approximately on the sphere" does not mean norm is completely meaningless

There is an important boundary condition here. Saying that embeddings approximately live on a sphere does not mean norm never carries information. In real systems, vector length can still correlate with frequency, confidence, salience, or sample difficulty. Some tasks even use norm deliberately as an extra signal.

So the accurate statement is: in many modern representation-learning tasks, direction is more stable, more transferable, and closer to the shared semantic structure than norm is. Norm has not been banned; it is simply no longer the main discriminative variable.

This distinction matters. If we misread "hyperspherical embeddings" as "all vectors must have exact unit norm," we turn a useful first-order model into a false hard constraint.

5. When do cosine and Euclidean distance stop being approximately equivalent?

The earlier relation between cosine and Euclidean distance only holds once norm variation is small enough. Writing the radii explicitly makes the condition clear. Let

r_{x} = ∥ x ∥, r_{y} = ∥ y ∥,

then

∥ x - y ∥^{2} = (r_{x} - r_{y})^{2} + 2 r_{x} r_{y} (1 - \cos θ) .

This decomposition shows that Euclidean distance contains two signals: radial difference $(r_{x} - r_{y})^{2}$ and angular difference $2 r_{x} r_{y} (1 - \cos θ)$ . Only when $r_{x}$ and $r_{y}$ are nearly constant does the first term become negligible and Euclidean distance reduce to angular structure. If norm itself carries systematic signal, such as frequency, confidence, activity, or difficulty, then the two metrics rank pairs differently.

This is an important practical check. If the task only cares about semantic direction, cosine is often more robust. If norm also matters, dropping it may remove useful information. Likewise, if the representation is strongly anisotropic or dominated by bias directions, switching to cosine alone may not be enough; centering, whitening, or additional normalization may still be necessary.

6. A unified view: representation space is closer to a hyperspherical manifold than a full Euclidean volume

Putting the logic together gives a complete chain:

high-dimensional probability pushes samples onto a thin shell;
random directions on that shell become nearly orthogonal;
training objectives further amplify directional readability and reduce radial noise;
retrieval, contrastive learning, and classification keep rewarding this structure.

So although embeddings still formally live in $R^{d}$ , a better first-order approximation is often that they have been trained into a high-dimensional hyperspherical manifold with local semantic structure. The model does not use the entire Euclidean volume evenly. It compresses useful freedom into a more regular, lower-entropy geometric object.

That is why understanding embeddings cannot stop at inspecting individual coordinates. The right questions concern norm distributions, angular distributions, local neighborhoods, and how the whole point cloud spreads over the sphere.

7. Closing

The fact that embeddings gather near a hypersphere is not an accidental engineering artifact. It is a stable outcome of high-dimensional probability together with modern training objectives. Norm concentration sets the stage, near-orthogonality provides directional capacity, and training writes semantics into those directional relations.

More sharply: cosine similarity works so often in embedding systems not just because it is convenient, but because trained representations increasingly behave like spherical geometry. That is also the direct setup for later discussions of spherical codes, vocabulary capacity, and how LLM embeddings are organized.

References

[1] VERSHYNIN R. High-Dimensional Probability: An Introduction with Applications in Data Science[M]. Cambridge: Cambridge University Press, 2018. DOI: 10.1017/9781108231596.

[2] WANG T, ISOLA P. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere[C]// Proceedings of the 37th International Conference on Machine Learning. PMLR, 2020: 9929-9939. URL: https://proceedings.mlr.press/v119/wang20k.html.

[3] WANG F, XIANG X, CHENG J, et al. NormFace: L2 Hypersphere Embedding for Face Verification[C]// Proceedings of the 25th ACM International Conference on Multimedia. New York: ACM, 2017: 1041-1049. DOI: 10.1145/3123266.3123359.

[4] LIU W, WEN Y, YU Z, et al. SphereFace: Deep Hypersphere Embedding for Face Recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6738-6746. DOI: 10.1109/CVPR.2017.713.

[5] DENG J, GUO J, XUE N, et al. ArcFace: Additive Angular Margin Loss for Deep Face Recognition[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 4690-4699. DOI: 10.1109/CVPR.2019.00482.

Why Embeddings Gather on an Approximate Hypersphere ​

1. The first source of shell structure: norm concentration in high dimensions ​

2. The second source of shell structure: training actively reduces radial freedom ​

3. Why does cosine similarity become the natural metric? ​

4. "Approximately on the sphere" does not mean norm is completely meaningless ​

5. When do cosine and Euclidean distance stop being approximately equivalent? ​

6. A unified view: representation space is closer to a hyperspherical manifold than a full Euclidean volume ​

7. Closing ​

References ​