SimHash

SimHash is a locality-sensitive hashing technique that produces similar fingerprints for similar inputs. zetl uses it for fuzzy page name matching via the Similar Command.

The idea

Traditional hashes (like SHA-256 or BLAKE3) are designed so that even tiny input changes produce completely different outputs. SimHash does the opposite — similar inputs produce fingerprints that differ in only a few bits.

How it works

Break the input into overlapping pieces (character trigrams for page names)
Hash each piece independently
Combine the hashes by voting: for each bit position, count +1 for 1-bits and -1 for 0-bits across all piece hashes
The final fingerprint bit is 1 if the vote is positive, 0 otherwise

Hamming distance

Similarity is measured by Hamming distance — the number of differing bits between two 64-bit fingerprints. Lower distance means more similar:

Distance	Interpretation
0	Identical (after normalization)
1–5	Very similar (minor typos)
6–12	Somewhat similar
13+	Likely different

Comparing two fingerprints takes a single XOR + popcount — O(1).

Use cases in zetl

Typo correction: zetl similar "zettelkasen" finds “Zettelkasten”
Duplicate detection: pages with very low Hamming distance might be candidates for merging
Fuzzy lookup: when --fuzzy is passed to Links Command, page names are matched via SimHash

Limitations

SimHash is effective for short strings (page names) but not for full document content. For content search, use Search Command. For content-level change detection, zetl uses Merkle Tree hashing instead.

For the implementation details, see architecture/SimHash.

Flag	Default	Description
`--context N`	0	Include N characters of surrounding text
`--limit N`	50	Max results to return
`--regex`	off	Interpret query as a regular expression
`--case-sensitive`	off	Require exact case match
`--all`	off	Search raw content (include frontmatter, code blocks)
`--path <glob>`	none	Restrict results to files matching glob

Flag	Default	Description
`--threshold N`	12	Max Hamming distance (lower = stricter)
`--limit N`	10	Max results

Leaf type	Source
Heading	`## Section Title`
Paragraph	Prose text blocks
SplBlock	```spl fenced code blocks
Code	Non-SPL fenced code blocks
Table	Markdown tables
List	Ordered/unordered lists
Blockquote	`>` block quotes
Frontmatter	YAML between `---` fences

Flag	Default	Description
`--depth N`	1	Traverse N hops (1 = direct only)
`--fuzzy`	off	Enable fuzzy page name matching via concepts/SimHash
`--context N`	0	Include N characters of surrounding text
`--with-conclusions`	off	Show SPL conclusions each linked page contributes