miscfreemedium

Lost in Hyperspace

HackTheBox

The challenge provided a zip file (password: `hackthebox`) containing a single file: `token_embeddings.npz`.

$ ls tags/ techniques/
greedy_tspnearest_neighbor_heuristicpca_explorationpairwise_distance_analysishamiltonian_path

Lost in Hyperspace — HackTheBox

Description

"A cube is the shadow of a tesseract casted on 3 dimensions. I wonder what other secrets may the shadows hold."

The challenge provided a zip file (password: hackthebox) containing a single file: token_embeddings.npz.

Files

  • token_embeddings.npz — NumPy compressed archive with two arrays: tokens and embeddings

Analysis

Data Structure

import numpy as np data = np.load('token_embeddings.npz') tokens = data['tokens'] # shape (110,), dtype <U1 — single characters embeddings = data['embeddings'] # shape (110, 512), dtype float64

Key observations:

  • 110 tokens, only 41 unique characters — letters, digits, symbols including {, }, _, #, !, -
  • Same characters have different embeddings (pairwise distances of 8–30 between duplicates), meaning each token has a unique position-dependent embedding
  • Embedding values range from -1.64 to 1.64, norms range from 4.7 to 16.8

PCA Exploration

PCA revealed the embedding structure:

  • Two dominant principal components explaining ~83% of variance (44.6% + 38.1%)
  • Remaining components each explained <0.35%
  • Sorting tokens by PC1 or PC2 alone did not produce readable text
  • KMeans clustering with various k values (2–16) showed some structure but no clear flag

Key Insight: Sequential Positional Encoding

The challenge hint about "shadows" pointed to the idea that the 512D embeddings encode sequential positional information. Consecutive characters in the original message should have nearby embeddings in the high-dimensional space.

This transforms the problem into finding a Hamiltonian path through 110 nodes — essentially a Traveling Salesman Problem (TSP).

Solution

Strategy: Greedy Nearest-Neighbor TSP

  1. Compute pairwise Euclidean distance matrix (110×110) using scipy.spatial.distance.cdist
  2. For each starting token, greedily visit the nearest unvisited token
  3. Try all 110 possible starting points and look for the flag in the reconstructed text

Solve Script

#!/usr/bin/env python3 """ Lost in Hyperspace — Greedy TSP on token embeddings. Consecutive characters in the hidden message have nearby 512D embeddings. """ import numpy as np from scipy.spatial.distance import cdist # Load data data = np.load('token_embeddings.npz') tokens = data['tokens'] # (110,) single characters embeddings = data['embeddings'] # (110, 512) # Compute pairwise Euclidean distance matrix dist_matrix = cdist(embeddings, embeddings, metric='euclidean') def greedy_tsp(start_idx, dist_matrix): """Nearest-neighbor greedy TSP from a given starting node.""" n = dist_matrix.shape[0] visited = [start_idx] unvisited = set(range(n)) - {start_idx} while unvisited: current = visited[-1] # Find nearest unvisited node nearest = min(unvisited, key=lambda j: dist_matrix[current][j]) visited.append(nearest) unvisited.remove(nearest) return visited # Try all 110 starting points for start in range(len(tokens)): path = greedy_tsp(start, dist_matrix) text = ''.join(tokens[i] for i in path) if 'HTB{' in text: # Extract flag flag_start = text.index('HTB{') flag_end = text.index('}', flag_start) + 1 flag = text[flag_start:flag_end] print(f"Start index {start} ('{tokens[start]}'):") print(f" Full text: {text}") print(f" Flag: {flag}") # Verify: show distances between consecutive flag characters flag_indices = path[flag_start:flag_end] print(" Consecutive distances in flag:") for i in range(len(flag_indices) - 1): a, b = flag_indices[i], flag_indices[i+1] d = dist_matrix[a][b] print(f" {tokens[a]}{tokens[b]}: {d:.2f}") break

Flag Recovery

Starting from token index 80 (H), the greedy TSP produced:

HLPRI8WBRTYH{{DFPVAZX845CMNBVE}77}SFDCE123____HTB{L0ST_1N_TH3_SP1R4L}!}954-!{FOIURTEQW645#Z4XNLEPOKGEFISDUVBMA

The flag HTB{L0ST_1N_TH3_SP1R4L} appeared at position 46 in the reconstructed sequence.

Verification

Consecutive characters in the flag had consistently small distances (~6–7 units), confirming the path was correct:

TransitionDistance
H→T6.60
T→B6.69
B→{6.37
{→L6.91
L→06.81
0→S6.54
S→T6.72
T→_6.45
...~6–7

Multiple starting points (indices 16, 40, 85, 96) all produced paths containing the same flag substring, confirming robustness of the approach.

Lessons Learned

  1. Positional embeddings encode sequence order — when same characters have different vectors, the embedding captures position, not just identity. Nearby positions → nearby vectors.
  2. Greedy nearest-neighbor TSP is surprisingly effective — for well-separated sequential embeddings, the greedy heuristic finds the correct path without needing optimal TSP solvers.
  3. Try all starting points — the greedy path is sensitive to the starting node. With only 110 tokens, trying all starts is trivial (O(n²) per start, O(n³) total).
  4. PCA is useful for exploration but not the solution — it confirmed structure existed (83% variance in 2 components) but the actual decoding required the TSP approach in the full 512D space.
  5. Distance consistency validates the solution — consecutive flag characters having uniform ~6–7 unit distances confirms the path is following the intended sequence, not jumping randomly.

$ cat /etc/motd

Liked this one?

Pro unlocks every writeup, every flag, and API access. $9/mo.

$ cat pricing.md

$ grep --similar

Similar writeups