Lost in Hyperspace

HackTheBox

The challenge provided a zip file (password: `hackthebox`) containing a single file: `token_embeddings.npz`.

numpy ml embeddings tsp pca dimensionality_reduction npz token_embeddings nearest_neighbor

greedy_tspnearest_neighbor_heuristicpca_explorationpairwise_distance_analysishamiltonian_path

$ ls tags/ techniques/

numpy ml embeddings tsp pca dimensionality_reduction npz token_embeddings nearest_neighbor

greedy_tspnearest_neighbor_heuristicpca_explorationpairwise_distance_analysishamiltonian_path

$ cat /etc/rate-limit

Rate limit reached (20 reads/hour per IP). Showing preview only — full content returns at the next hour roll-over.

Lost in Hyperspace — HackTheBox

Description

"A cube is the shadow of a tesseract casted on 3 dimensions. I wonder what other secrets may the shadows hold."

The challenge provided a zip file (password: hackthebox) containing a single file: token_embeddings.npz.

Files

token_embeddings.npz — NumPy compressed archive with two arrays: tokens and embeddings

Analysis

Data Structure

import numpy as np
data = np.load('token_embeddings.npz')

tokens = data['tokens']      # shape (110,), dtype <U1 — single characters
embeddings = data['embeddings']  # shape (110, 512), dtype float64

Key observations:

110 tokens, only 41 unique characters — letters, digits, symbols including {, }, _, #, !, -
Same characters have different embeddings (pairwise distances of 8–30 between duplicates), meaning each token has a unique position-dependent embedding
Embedding values range from -1.64 to 1.64, norms range from 4.7 to 16.8

PCA Exploration

PCA revealed the embedding structure:

Two dominant principal components explaining ~83% of variance (44.6% + 38.1%)
Remaining components each explained <0.35%
Sorting tokens by PC1 or PC2 alone did not produce readable text
KMeans clustering with various k values (2–16) showed some structure but no clear flag

Key Insight: Sequential Positional Encoding

The challenge hint about "shadows" pointed to the idea that the 512D embeddings encode sequential positional information. Consecutive characters in the original message should have nearby embeddings in the high-dimensional space.

This transforms the problem into finding a Hamiltonian path through 110 nodes — essentially a Traveling Salesman Problem (TSP).

Solution

Strategy: Greedy Nearest-Neighbor TSP

Compute pairwise Euclidean distance matrix (110×110) using scipy.spatial.distance.cdist
For each starting token, greedily visit the nearest unvisited token
Try all 110 possible starting points and look for the flag in the reconstructed text

...