NIP-FF-3 Edit-Durable OpenTimestamp Content Attribution

Specifies: kind 1041 kind 1042

NIP-FF-3

Edit-Durable OpenTimestamp Content Attribution

draft optional

NIP-03 was unrecommended due to a timestamp-forgery attack. This NIP fixes this vulnerability while adding useful new properties to the nostr attestation that OpenTimestamp can provide.

This document has two top-level parts: an informal Explanation of what it does and why, and a normative Specification.

Contents

Explanation

Key Improvements

Vulnerability Fixes

NIP-FF-3 is not vulnerable to the timestamp forgery attack NIP-03 is vulnerable to.

Content Attestation VS Event Attestation

NIP-03 enabled a keypair to attest to an event’s existence before a block height.

NIP-FF-3 enables a keypair to attest to their content’s existence before a block height. This is useful for clients that wish to implement plagiarism detection in order to protect creators’ work and promote original content.

Edit Durability

NIP-03 was technically usable for replaceable/addressable events, but any edit would invalidate the prior OTS.

NIP-FF-3 allows the OTS timestamp to remain cryptographically valid for minor content edits.

Content Fingerprint Mechanism

This NIP uses a content similarity fingerprint: a 256-bit digest computed not from the raw bytes but from a MinHash sketch of the content’s hashed token shingles. Unlike a typical hash function, this technique produces the same digest for two pieces of content that are mostly similar (not identical). It is MinHash, not Charikar SimHash; the precise construction and the reason that distinction matters are in The minhash-equality-v1 Algorithm. For brevity this explanation layer calls it the “fingerprint.”

As two inputs become less similar, they are progressively less likely to share a fingerprint; once they fall below the algorithm’s tolerance they map to entirely different ones. Unlike SimHash there is no graduated distance between fingerprints: any two are either byte-identical or unrelated.

Properties of the algorithm can be adjusted in order to achieve durability under desired conditions. The variant proposed below is optimized for typical nostr-length content: short notes to longer longform articles.

Calculating the fingerprint of some content is effectively as trivial as calculating the sha256 of the same content. It is a cheap operation.

When publishing content, the 256-bit fingerprint is calculated for the content and written as hex in the event’s “X” tag. This enables nostr-native similarity search.

By querying relays (REQ) for the “X” tag, you can find all events that have the same “X” tag, meaning they have the same or similar content. The content can be re-fingerprinted to verify that the “X” tag correctly matches the content.

Plagiarism Detection

If you REQ for an “X” tag and get multiple results by different authors, it can have multiple meanings:

  • the authors unintentionally or coincidentally published very similar or identical content (“GM”)
  • one author intentionally plagiarised another author

The first event published would have a high probability of being the original. However, the created_at of the events can’t be trusted, so this is not a reliable way to determine which event may be plagiarism.

The results of an “X” tag REQ don’t make any judgement, but they can assist in making a judgement. Social signals can be crucial in determining which author produced the original content.

Incentives

Including the fingerprint of your content on your nostr event is a signal that says, “I am willing to participate in a system where plagiarism is easy to detect.”

Likewise, with trivial plagiarism detection available, omitting the “X” tag on your event is a signal that says “I may not be willing to participate in a system where plagiarism is easy to detect.”

Because fingerprinting is trivially cheap, honest parties should not have any problem including the “X” tag on the content they publish.

Social Trust and Further Steps

The “X” tag combined with social webs of trust can, in many cases, provide enough information to decide which version of the content is legitimate or original. However, using OpenTimestamps we can provide an even stronger assurance by creating a reliable time-based anchor for the event’s fingerprint.

OpenTimestamp Mechanism

To anchor the fingerprint to a block height using OTS, here is what must be done. This is non-normative to convey the concept; normative sections are in the Specification section.

  • The original event author creates their content and fingerprints it. They include the fingerprint in the “X” tag.
  • They sign a kind 1042 event which is a bare event containing the fingerprint. This binds the author’s private key via the signature to the fingerprint.
  • Then, the sha256 of the signature of the 1042 event is used as the OTS commitment.
  • Finally, the author publishes a kind 1041 event that contains:
    • the serialized 1042 event in the description tag
    • the OTS blob in the .content

The kind 1041 event serves as the public OTS attestation (similar to NIP-03’s kind 1040); it references the event containing the original content. The kind 1042 is not meant to be published; it is only meant to be used inside the description field of the attestation.

Plagiarism Detection with OTS

Now, when comparing events with the same “X” tag, the one with the earlier OTS proof has a high probability of being the authentic original content.

Edit Durability

As long as the content doesn’t change too much, its fingerprint “X” tag will remain the same. The OTS proof will be valid for as long as the content’s fingerprint remains identical. If the content’s fingerprint changes, the kind 1041 event’s fingerprint will no longer match, invalidating the proof.

Client Participation

Clients decide if this NIP is useful to them. Some clients may care about content authenticity, and some may not; still others may prefer plagiarised content. It isn’t possible to prevent plagiarism; even DRM software is consistently defeated. However, the tools proposed in this NIP enable people who care about authenticity to better quantify content authenticity and impose a greater social cost for dishonesty.

Motivation

Copyright law, as it exists today, is not built for independent artists: litigation is expensive, slow, and inaccessible to most creators. This NIP is not a legal instrument. It is a decentralized social legitimacy mechanism, a set of cryptographic primitives and public signals that let a community establish the most likely original author of a piece of content without courts, lawyers, or institutional intermediaries. As with Community Notes, the final determination always rests with the human reading the signal. No claim here is a verdict. The goal is to make plagiarism socially expensive and originality publicly verifiable.


Specification

The sections above explain what this NIP does and why. The rest of this document is the normative specification: exact event structures, the commitment and verification procedures, the security analysis, and the minhash-equality-v1 algorithm in enough detail to implement from scratch.

This NIP defines two event kinds: kind:1041, a publicly discoverable Bitcoin timestamp that attributes content to an author, and kind:1042, the author-signed authorship attestation that kind:1041 anchors. Both are new kinds rather than an augmentation of NIP-03’s kind:1040, for the reasons in the next section.

The Distinction from NIP-03

The explanation layer covered this conceptually; here it is precisely. NIP-03 (kind:1040) commits to an event ID, proving that a specific signed event existed before a block. This NIP commits to the author’s signature on a fingerprint attestation, proving that a specific author vouched for content of a given character before a block. That shift is what buys edit-durability, content-level rather than event-level attribution, and resistance to the forgery that got NIP-03 flagged unrecommended (detailed as Attack B under The OTS Commitment).

NIP-03 kind:1040 NIP-FF-3 (kind:1041 + kind:1042)
What is committed Event ID SHA256 of the author’s signature on a fingerprint attestation
What is claimed This artifact existed I authored content of this character
Identity bound in proof No Yes: the committed signature verifies only under the author’s key
Computable without the author’s private key Yes, enables the pre-stamp attack No, forming the commitment requires the author’s key
Survives edits No, a new event ID breaks the proof Yes, the attestation binds the fingerprint, not an event ID
Designed for Any immutable event Replaceable and addressable content events
created_at relevance Informational only Not used; Bitcoin block height is authoritative

NIP-03’s normative requirement states: “The OpenTimestamps proof MUST prove the referenced e event id as its digest.” This NIP does not satisfy that requirement: it deliberately commits to a different value (a signature, not an event ID). This is why these are new kinds and not an augmentation of kind:1040.


Two Events, Two Roles

This NIP uses two events. Understanding the split is the key to the whole design:

  • kind:1042, the Authorship Attestation. A small event the author signs over the content’s fingerprint. Its signature is the cryptographic claim “I, this author, vouch for content with this fingerprint.” It is not broadcast to relays; it is carried, serialized, inside the kind:1041 event (the same pattern NIP-57 uses to carry a zap request inside a zap receipt). Producing it is the one and only step that requires the author’s private key.
  • kind:1041, the Fingerprint Timestamp. The public, discoverable event. It carries the OpenTimestamps proof, references the content event, and embeds the kind:1042 attestation. It may be signed and published by anyone, by the author or a timestamping service acting on their behalf. None of its own fields are trusted on their own; all trust derives from the embedded, author-signed attestation (see Client Verification).

The OTS proof commits to SHA256 of the kind:1042 signature. Because that signature can only be produced by the author, the timestamp cannot be forged for someone else, yet the act of anchoring and publishing remains fully delegable.

Kind:1041 Event Structure (Fingerprint Timestamp)

Tags

A kind:1041 event MUST include the following tags:

For addressable content events (kind 30000-39999, identified by kind:pubkey:d-tag coordinate):

["a", "<kind>:<author_pubkey>:<d-tag>", "<relay-url>"]
["k", "<content_kind>"]
["X", "<fingerprint_hex>", "minhash-equality-v1"]
["description", "<serialized kind:1042 attestation event, as a JSON string>"]

For non-addressable content events (kind 1 and others identified only by event ID):

["e", "<event_id>", "<relay-url>"]
["p", "<author_pubkey>"]
["k", "<content_kind>"]
["X", "<fingerprint_hex>", "minhash-equality-v1"]
["description", "<serialized kind:1042 attestation event, as a JSON string>"]
Tag Required Description
a For addressable events The stable coordinate of the content event. Persists across edits.
e For non-addressable events The event ID of the content event.
p When using e tag The content author’s pubkey. Allows verification without a relay lookup.
k Always The nostr kind integer of the content event, as a string.
X Always The 64-char lowercase hex fingerprint and algorithm identifier. Used for discovery; it MUST equal the X tag inside the embedded kind:1042 attestation. See The minhash-equality-v1 Algorithm.
description Always The full kind:1042 authorship attestation event, serialized as a JSON string (NIP-57 description convention). This is the author-signed object the OTS proof commits to. See The Authorship Attestation.

Content Field

The .content field MUST contain the full bytes of an OpenTimestamps .ots proof file, base64-encoded. The proof MUST contain at least one Bitcoin block attestation (no pending calendar-only proofs). The OTS proof commits to SHA256(attestation.sig), where attestation is the kind:1042 event carried in the description tag, as described below.

The created_at Field

The created_at field of a nostr event is self-reported and can be set to any value by the publisher. It is not used by this protocol for determining authorship priority. This applies to both the kind:1041 and the embedded kind:1042 event. The Bitcoin block height attested by the OTS proof is the only authoritative ordering signal.

The Authorship Attestation (kind:1042)

The kind:1042 event is the author’s signed claim. Its shape is rigidly fixed so that there is no ambiguity about what is being attested, and so a verifier can reconstruct and check it deterministically:

{
  "kind": 1042,
  "pubkey": "<content author pubkey>",
  "created_at": <any; not used for ordering>,
  "content": "",
  "tags": [
    ["X", "<fingerprint_hex>", "minhash-equality-v1"],
    ["k", "<content_kind>"]
  ],
  "id":  "<computed per NIP-01>",
  "sig": "<author's BIP-340 signature, the value the OTS commits to via SHA256>"
}
Field Requirement
kind MUST be 1042.
pubkey MUST be the content author. This is the identity the attestation binds.
content MUST be the empty string "".
X tag MUST be present exactly once: the 64-hex fingerprint and the "minhash-equality-v1" identifier. This is the fingerprint the author is vouching for.
k tag OPTIONAL: the content kind the attestation is scoped to.
created_at Author/signer chosen; ignored for ordering.

A kind:1042 event is produced through the ordinary nostr event-signing path (signEvent), so it works identically whether the author signs with a NIP-07 extension, a NIP-46 bunker, or a local key. Like a NIP-98 auth event, it is a signed token: it is embedded in a kind:1041 rather than broadcast on its own (though publishing it independently is harmless).


The OTS Commitment: What Gets Anchored

The OTS proof commits to the SHA256 of the author’s BIP-340 Schnorr signature on the kind:1042 attestation event. This single construction defeats two distinct attacks. Both must be understood, because each on its own would tempt a simpler, broken design.

Attack Vectors Thwarted

Attack A, OTS-blob theft. A kind:1041 event publishes its OTS proof in the clear (in .content). If the proof committed only to the fingerprint, a plagiarist could:

  1. Wait for an author to publish their kind:1041 event.
  2. Copy the public base64 OTS blob from the .content field.
  3. Publish their own kind:1041 for their plagiarized copy, reusing the stolen blob.

Because two near-identical texts share a fingerprint by design, the stolen blob would appear valid for the plagiarist too, at the same Bitcoin block height. The commitment must therefore be bound to the author’s identity, so a stolen blob cannot be re-attributed to someone else.

Attack B, pre-stamp and wait for key leak. This is the attack that got NIP-03 flagged unrecommended. If the committed value can be computed from public data alone, an attacker can:

  1. Pick a victim’s pubkey (public) and any content they wish to falsely attribute.
  2. Compute the commitment and OTS-stamp it today, obtaining a genuine, early Bitcoin block.
  3. Wait, possibly years, for the victim’s private key to leak.
  4. Once it leaks, sign and publish the content event and a matching kind:1041, now “proving” the victim authored the content before the (genuine, early) block.

A NIP-03 event-id commitment is computable from public data (the attacker chooses the content and created_at, and the pubkey is already public), which is exactly why the attack works. A bare-fingerprint commitment has the same flaw, and so does fingerprint XOR author_pubkey: both are computable by anyone who knows the public fingerprint and the public pubkey. Identity-XOR (the only protection in earlier drafts of this NIP) stops Attack A but does nothing against Attack B.

Why a Signature Defeats Both

The author’s signature on the kind:1042 attestation cannot be produced without the author’s private key, and it verifies only under the author’s public key. Committing to that signature gives both properties at once:

  • Against Attack A (blob theft): the committed signature is intrinsically identity-bound. A stolen blob commits to the SHA256 of Alice’s signature; it cannot be presented as Bob’s, because Bob’s pubkey will not verify Alice’s attestation, so the mandatory signature check (see verification) fails before OTS is even consulted.
  • Against Attack B (pre-stamp): the commitment can no longer be computed from public data. To form it, the attacker must already hold the private key at stamp time. If the key has not leaked, no commitment can be made; if it leaks later, any stamp the attacker then makes carries an honest, post-leak Bitcoin block height and cannot be back-dated.

The signature is what carries the security. The fingerprint inside the attestation provides edit-tolerant linkage, and the pubkey identifies whose key to check, but neither contributes attack resistance, because both are public. This is why the committed value is the signature, not the fingerprint or the pubkey.

No raw-message signing is required (and none is possible through NIP-07/NIP-46 signers). The author signs an ordinary nostr event; its structure (kind 1042 + the X tag) already separates it from any other signing domain.

Computing the Commitment

  1. Build the attestation. Construct the kind:1042 event described in The Authorship Attestation: the author’s pubkey, empty content, and the ["X", fingerprint_hex, "minhash-equality-v1"] tag (plus optional k).

  2. Sign it. The author signs the kind:1042 event through the normal signEvent path, yielding attestation.id and attestation.sig (a 128-hex / 64-byte BIP-340 signature). This is the only step that requires the author’s private key.

  3. Commit. The value anchored to Bitcoin is the SHA256 of that signature’s raw bytes:

    commitment_bytes = SHA256( hex_decode(attestation.sig) )   // 32 bytes
    commitment_hex   = hex_encode(commitment_bytes)            // 64 lowercase hex chars
    

    Submit commitment_hex to the OpenTimestamps service as the hash to be anchored. The resulting .ots binary proof commits to commitment_hex.

  4. Publish. Serialize the signed kind:1042 event to a JSON string, place it in the ["description", …] tag of a kind:1041 event, base64-encode the Bitcoin-attested .ots proof into the kind:1041 .content, and publish. Steps 3-4 require no private key and MAY be performed by a service on the author’s behalf.

Resolving the Author’s Pubkey

The pubkey that must verify the attestation is the content author’s pubkey. Because the kind:1041 may be published by anyone, this pubkey is taken from the content reference and then cross-checked against the embedded attestation:

  • From an a tag: parse the coordinate kind:pubkey:d-tag and take the middle segment.
  • From a p tag (used with e-tag events): take element [1] of the p tag directly.

The embedded kind:1042 attestation’s pubkey MUST equal this resolved value, and the attestation’s signature MUST verify (standard nostr event verification) before any trust is placed in the kind:1041. See Client Verification and Stamping Requires the Author’s Key below.


Client Verification Algorithm

Because a kind:1041 may be signed and published by anyone, none of its own fields are trusted on their own. All trust derives from the embedded kind:1042 attestation plus the cross-checks below. To verify a kind:1041 event:

  1. Confirm event.kind === 1041.
  2. Parse the description tag value as JSON to obtain the embedded attestation event A.
  3. Verify A as a standard nostr event: recompute A.id from its serialized fields (NIP-01) and verify A.sig against A.id under A.pubkey. Confirm A.kind === 1042. If this fails, reject: this is the identity and key-possession check that a thief or pre-stamper cannot satisfy.
  4. Cross-check identity: resolve the content author’s pubkey from the kind:1041’s a tag (middle segment of kind:pubkey:d-tag) or p tag, and confirm it equals A.pubkey.
  5. Cross-check the fingerprint: confirm A‘s X tag equals the kind:1041’s X tag, that element [2] is "minhash-equality-v1", and read fingerprint_hex from it. The k tags on the two events SHOULD also agree; a mismatch is advisory metadata only and does not by itself invalidate the proof.
  6. Compute the commitment: commitment_hex = hex(SHA256(hex_decode(A.sig))).
  7. Base64-decode the kind:1041 .content to obtain the raw .ots binary.
  8. Use an OpenTimestamps client to verify the .ots binary against commitment_hex, and read the confirmed Bitcoin block height.

The kind:1041 is valid only if all of steps 3-5 and 8 pass. Among all valid kind:1041 events sharing an X tag value, the one with the earliest confirmed Bitcoin block height is the most likely original author.

What a Public Verifier Can and Cannot Confirm

The checks above prove: “the holder of A.pubkey’s key signed an attestation for fingerprint S before Bitcoin block N.” They do not, by themselves, prove that S is the true fingerprint of any particular content, because the fingerprint is computed over the pre-encryption plaintext (see Input: What Text to Hash). A public observer who holds only the encrypted event cannot recompute S; only the author or a buyer who possesses the plaintext can confirm that S is the fingerprint of that plaintext. This is inherent to attributing paywalled content: the timestamped claim is public and identity-bound, but tying it to specific bytes requires access to those bytes.


Stamping Requires the Author’s Key

Earlier drafts of this NIP allowed anyone to stamp on an author’s behalf: because the commitment fingerprint XOR author_pubkey was computable from public data, a service (or an attacker) could anchor it without the author’s involvement. That convenience was Attack B. A third party who knew a victim’s pubkey could pre-stamp arbitrary content attributed to that victim and wait for the key to leak. Requiring the author’s signature removes this capability by design; it is the fix, not a regression.

Under this construction, the kind:1041 cannot be formed without a valid kind:1042 attestation, which only the holder of the author’s private key can sign. The consequences:

  • A third party cannot manufacture an attestation for content the author never signed. This is precisely the property that closes the pre-stamp attack. There is no longer any way to anchor an attribution to an author who has not personally signed a kind:1042 event.
  • A service can still operate the anchoring workflow. Once the author signs the kind:1042 attestation, they can hand it to a service that computes the commitment, submits it to OpenTimestamps, waits for Bitcoin confirmation, and publishes the kind:1041. The author’s only required action is the single signature; everything downstream (OTS submission, calendar polling, publishing) is delegable. Services such as Fanfares can still batch and operate timestamping on creators’ behalf.
  • The kind:1041 MAY be signed by any pubkey. The identity that signs the kind:1041 event is irrelevant to validity; security rests entirely on the embedded kind:1042 attestation verifying under the content author’s pubkey. For the clearest public claim, the author SHOULD also sign the kind:1041 event itself.

In short: anchoring and publishing remain fully delegable, but the author’s participation (one signature on a kind:1042 attestation) is now mandatory and cannot be forged.


Relay Discovery

Because the X tag carries the raw fingerprint (not the commitment), plagiarism detection and timestamp discovery use the same relay query:

["REQ", "plagiarism-check", { "kinds": [1, 30023, 31337, 31338, 31339, 1041], "#X": ["<fingerprint_hex>"] }]

This single subscription returns all content events sharing a fingerprint AND all their kind:1041 timestamp proofs. Clients sort kind:1041 results by verified OTS block height to surface the most likely original author.


Full Event Examples

The id, sig, and X hex values below are illustrative placeholders and will not actually verify or match a real descriptor; they stand in only to show structure.

The authorship attestation (kind:1042)

This is what the author signs. Its signature is the value the OTS commits to (via SHA256 of the signature’s raw bytes).

{
  "kind": 1042,
  "created_at": 1715779000,
  "pubkey": "3bf0c63fcb93463407af97a5e5ee64fa883d107ef9e558472c4eb9aaaefa459d",
  "content": "",
  "tags": [
    ["X", "a3f4e5d6c7b8a9f0e1d2c3b4a5968778b2c1d0e9f8a7b6c5d4e3f2a1b0c9d8e7", "minhash-equality-v1"],
    ["k", "30023"]
  ],
  "id": "9b2c7d1e0f3a4b5c6d7e8f90a1b2c3d4e5f60718293a4b5c6d7e8f90a1b2c3d4",
  "sig": "a1b2c3d4e5f60718293a4b5c6d7e8f90112233445566778899aabbccddeeff00f0e1d2c3b4a5968778695a4b3c2d1e0f00ffeeddccbbaa998877665544332211"
}

Addressable content event (kind:30023 long-form article)

The kind:1042 event above is serialized to a JSON string and carried in the description tag (id/sig shown elided for readability):

{
  "kind": 1041,
  "created_at": 1715780000,
  "pubkey": "3bf0c63fcb93463407af97a5e5ee64fa883d107ef9e558472c4eb9aaaefa459d",
  "content": "AE9UUwABAQT/AQID...==",  // base64-encoded .ots file committing to SHA256(attestation.sig)
  "tags": [
    ["a", "30023:3bf0c63fcb93463407af97a5e5ee64fa883d107ef9e558472c4eb9aaaefa459d:550e8400-e29b-41d4-a716-446655440000", "wss://fanfares.nostr1.com"],
    ["k", "30023"],
    ["X", "a3f4e5d6c7b8a9f0e1d2c3b4a5968778b2c1d0e9f8a7b6c5d4e3f2a1b0c9d8e7", "minhash-equality-v1"],
    ["description", "{\"kind\":1042,\"created_at\":1715779000,\"pubkey\":\"3bf0c63fcb93463407af97a5e5ee64fa883d107ef9e558472c4eb9aaaefa459d\",\"content\":\"\",\"tags\":[[\"X\",\"a3f4e5d6c7b8a9f0e1d2c3b4a5968778b2c1d0e9f8a7b6c5d4e3f2a1b0c9d8e7\",\"minhash-equality-v1\"],[\"k\",\"30023\"]],\"id\":\"9b2c…c3d4\",\"sig\":\"a1b2…2211\"}"]
  ],
  "id": "...",
  "sig": "..."
}

Non-addressable content event (kind:1 note)

{
  "kind": 1041,
  "created_at": 1715780000,
  "pubkey": "3bf0c63fcb93463407af97a5e5ee64fa883d107ef9e558472c4eb9aaaefa459d",
  "content": "AE9UUwABAQT/AQID...==",  // base64-encoded .ots file committing to SHA256(attestation.sig)
  "tags": [
    ["e", "b4e5f6a7d8c9b0a1f2e3d4c5b6a79889c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5", "wss://fanfares.nostr1.com"],
    ["p", "3bf0c63fcb93463407af97a5e5ee64fa883d107ef9e558472c4eb9aaaefa459d"],
    ["k", "1"],
    ["X", "a3f4e5d6c7b8a9f0e1d2c3b4a5968778b2c1d0e9f8a7b6c5d4e3f2a1b0c9d8e7", "minhash-equality-v1"],
    ["description", "{\"kind\":1042,\"created_at\":1715779000,\"pubkey\":\"3bf0…459d\",\"content\":\"\",\"tags\":[[\"X\",\"a3f4…d8e7\",\"minhash-equality-v1\"],[\"k\",\"1\"]],\"id\":\"…\",\"sig\":\"…\"}"]
  ],
  "id": "...",
  "sig": "..."
}

The minhash-equality-v1 Algorithm

This section specifies how to compute the value that goes in element [1] of the X tag. It is written for developers with no prior knowledge of similarity hashing or natural language processing.

What This Fingerprint Is (and Is Not)

A cryptographic hash function like SHA256 produces completely different output for even a single character change in the input. That is a desirable property for most uses, but it makes SHA256 useless for detecting plagiarism: if someone copies your article and changes one word, the SHA256 is entirely different.

This NIP uses a similarity fingerprint that does the opposite: similar inputs produce the same output. The construction defined here, minhash-equality-v1, is MinHash, not Charikar SimHash. Precisely, it is b-bit one-permutation MinHash (MinHash: Broder 1997; one-permutation: Li, Owen, Zhang 2012; b-bit minwise: Li, Konig 2010), tuned to produce identical output for texts that are semantically equivalent, even if they differ in minor ways like spacing, capitalization, URL changes, or light rephrasing. It approximates Jaccard (token-set) similarity, which suits “did this reuse my words” plagiarism detection. Despite the simhash-ts package name, this is not a sign-of-random-projections SimHash and is not compared by Hamming distance: two fingerprints are either byte-identical or unrelated. Two texts that share the same minhash-equality-v1 fingerprint are, by the definition of this algorithm, close enough in content to warrant human review for plagiarism.

The 8-bucket, low-order-hex parameters below were chosen from the empirical collision study in kb-private ADR-005, which measured and corrected a long-content false-positive defect in the earlier simhash-equality-v2. See Algorithm Version for the deprecation note.

This makes the fingerprint useful both as a plagiarism detection mechanism (compare X tags across events) and as the payload of the author’s signed attestation (kind:1042): a timestamp proof anchored to that signature says “this author had content of this character before Bitcoin block N,” and it remains valid as long as the content does not change enough to shift the fingerprint.

Input: What Text to Hash

The fingerprint is computed from the pre-encryption plaintext of the content event: the full secret text before it is encrypted and placed in the encrypted tag. For kind:1 and kind:30023, this is the full article or note body.

Do not compute the fingerprint from the .content field of the published event (which contains only the public preview), and do not compute it from the ciphertext.

Step 1: Canonicalize the Text

Canonicalization transforms the text into a normalized form so that cosmetic differences (accented letters, mixed-case, punctuation, numbers, URLs, extra spaces) do not produce different fingerprints for what is semantically the same content.

Apply these transformations in order:

  1. NFKD normalize the Unicode string. NFKD (compatibility decomposition) breaks compound characters into their base letter plus separate combining marks. For example, é (U+00E9) becomes e followed by a combining acute accent mark (U+0301).

  2. Strip combining marks: remove all characters in Unicode category M (marks: accents, diacritics, etc.) that were exposed by the NFKD step. After steps 1 and 2, é, ê, ë all become e. This makes the fingerprint insensitive to diacritics.

  3. Lowercase every character.

  4. Normalize line endings: replace every \r\n (Windows) and standalone \r (old Mac) with \n.

  5. Replace URLs: replace every substring matching the pattern http:// or https:// followed by any non-whitespace characters with the literal string url (space, the word “url”, space). URLs change frequently and should not affect the fingerprint.

  6. Remove zero-width characters: delete U+200B (zero-width space), U+200C (zero-width non-joiner), U+200D (zero-width joiner), and U+FEFF (byte-order mark / zero-width no-break space). These invisible characters are sometimes inserted to create artificially distinct copies of text.

  7. Remove punctuation and symbols: replace every sequence of one or more characters in Unicode categories P (punctuation) or S (symbols) with a single space. This removes hyphens, quotes, em-dashes, brackets, emoji, etc.

  8. Replace numbers: replace every sequence of one or more Unicode decimal digit characters (\p{N}) with the literal string num (space, the word “num”, space). This makes the fingerprint insensitive to numerical values that a plagiarist might change.

  9. Collapse whitespace: replace every sequence of one or more whitespace characters (spaces, tabs, newlines) with a single space character.

  10. Trim: remove any leading or trailing spaces.

After these steps, the text is a single normalized lowercase string containing only letters (without diacritics) and the special tokens url and num, separated by single spaces.

Step 2: Extract Tokens

Extract words from the canonicalized text using a Unicode-aware word match: find all maximal sequences of Unicode letters (\p{L}) or Unicode digits (\p{N}). Because step 1 has already replaced all digits with num, in practice this step extracts sequences of letters.

This is not a simple space-split. The regex match /[\p{L}\p{N}]+/gu is applied to the canonicalized text.

Example: after canonicalization, "The quick brown fox jumps" → extract ["the", "quick", "brown", "fox", "jump"] (after step 3 and 4 below trim it further).

Step 3: Stem Each Token

Stemming reduces a word to an approximation of its root form, so that different grammatical forms of the same word are treated as identical. For example, "cats" and "cat" should be the same feature; "running" and "runs" should be the same feature.

This algorithm uses a simple custom stemmer, not a library like Porter or Snowball. Apply the following four rules to each token, checking them in order and applying the first rule that matches. If no rule matches, return the token unchanged.

Rule Condition Action Example
1 token length > 5 and ends with ing remove the trailing ing "running" (7) → "runn"
2 token length > 4 and ends with ed remove the trailing ed "walked" (6) → "walk"
3 token length > 4 and ends with es remove the trailing es "watches" (7) → "watch"
4 token length > 3 and ends with s remove the trailing s "cats" (4) → "cat"

“Length” means the number of characters in the token before the suffix is removed. The length check prevents over-stemming very short words (e.g., "is" should not match rule 4 and become "i").

Apply stemming to all tokens before any filtering. The stemmed form is what gets filtered and hashed.

Step 4: Filter Tokens

After stemming, discard tokens that would pollute the fingerprint with noise:

4a. Remove stop words. Stop words are extremely common English function words that appear in virtually all text and carry no information about what the content is about. Discard any stemmed token whose exact string appears in this set:

a, an, the, and, or, but, if, to, of, in, on, for, with, at, by, from,
up, down, out, over, under, into, about, between, after, before, through,
during, without, within, is, are, was, were, be, been, being, it, its,
that, this, these, those, as, not, can, could, should, would, will, may,
might, do, does, did, done, have, has, had, i, you, he, she, we, they,
them, our, your, their

4b. Remove short tokens. Discard any token whose character length is less than 4. Very short tokens are typically too common or ambiguous to be useful features.

Fallback: If applying both filters leaves an empty list, skip the filters and use all stemmed tokens with character length greater than 0. This prevents the algorithm from producing a meaningless result on very short or repetitive input. If the token list is still empty after the fallback, see the empty-content handling in Step 7.

Step 5: Build Shingles

With shingleSize = 1, each token is its own shingle, a unit of content to be hashed. A shingle is just a string representing one feature of the text.

If there are fewer tokens than shingleSize (which only matters when shingleSize > 1), join all tokens into a single space-separated shingle. With the default shingleSize = 1 this case does not arise.

Step 6: Hash Each Shingle and Assign to a Bucket

This step reduces the full list of shingles to a compact eight-number sketch using MinHash (specifically one-permutation MinHash: a single hash whose output range is split into buckets). The intuition: instead of recording all shingles, record only the lexicographically smallest hash seen in each of eight buckets. Two texts with similar shingle sets will tend to share the same minimum hash in each bucket, producing the same sketch, and therefore the same fingerprint.

For each shingle:

  1. Prepend the prefix eqs: to the shingle string. (This prefix namespaces the hash domain.)
  2. UTF-8 encode the prefixed string and compute SHA256. You now have 32 bytes.
  3. Hex-encode those 32 bytes as 64 lowercase hex characters.
  4. Assign to a bucket: take the first byte of the SHA256 output as an unsigned integer (0-255) and compute first_byte % 8. The result (0 through 7) is the bucket index.
  5. Update the bucket minimum: if this shingle’s 64-character hex string is lexicographically less than the current minimum for that bucket (or if no minimum has been recorded yet), replace the bucket minimum with this hex string. Lexicographic comparison means standard string comparison on the hex characters (‘0’ < ‘1’ < … < ‘9’ < ‘a’ < … < ‘f’).

After processing all shingles, each bucket holds the single lexicographically smallest SHA256 hex string seen among all shingles assigned to it.

Step 7: Build the Descriptor String

For each bucket, take only the last 3 characters of its minimum hex string. Then assemble the following descriptor:

minhash-equality-v1|n=1|b=8|k=3|m=4|b0:<bucket_0_last_3>|b1:<bucket_1_last_3>|b2:<…>|b3:<…>|b4:<…>|b5:<…>|b6:<…>|b7:<bucket_7_last_3>

Where:

  • minhash-equality-v1: the algorithm version string, baked into the descriptor
  • n=1: shingle size
  • b=8: bucket count
  • k=3: hex characters kept per bucket
  • m=4: minimum token length from Step 4
  • b0:xxx through b7:xxx: the last 3 hex chars of each bucket’s minimum, written in bucket order 0 through 7

If a bucket received no shingles (nothing was assigned to it), write x instead of the 3-character minimum, e.g. b3:x.

Example:

minhash-equality-v1|n=1|b=8|k=3|m=4|b0:521|b1:3ed|b2:bc5|b3:x|b4:3ca|b5:x|b6:x|b7:6e1

Why the last 3 characters, and why 8 buckets? Both choices come from the collision study in ADR-005, and both fight a different cause of unrelated texts colliding.

Last 3 (low-order) hex. The kept value is the minimum hash in a bucket, and a minimum is a small number whose high-order hex digits concentrate toward zero, carrying almost no information on long text. The low-order hex digits stay uniformly distributed no matter how small the minimum is. Keeping the low end is the b-bit minwise hashing rule and is what stops unrelated long documents from colliding. (The deprecated simhash-equality-v2 kept the first 3, which was the bug.)

8 buckets. The fingerprint matches only when two texts agree on the minimum in all buckets (a logical AND). With more buckets, two unrelated documents that merely share one common high-frequency word no longer collide, because they would have to coincide in all 8 buckets; near-duplicates, which share most tokens, still do. This is intentional heavy quantization: 3 hex chars is 12 bits per bucket, a coarse space, so minor edits (a few different words, light paraphrasing, reordering) usually map to the same bucket minimums and the same fingerprint. The residual tradeoff is that unrelated texts can still rarely share a fingerprint (an advisory false positive) and that very short content is less edit-durable. This is acceptable because the system is advisory: humans make the final determination about plagiarism.

Empty-content fallback: If the token list is empty even after the Step 4 fallback (the text produced no usable tokens at all), skip Steps 5-6 and use this descriptor instead:

minhash-equality-v1|empty|<canonicalized_text>

Where <canonicalized_text> is the full output of Step 1.

Step 8: Compute the Final Fingerprint

UTF-8 encode the descriptor string from Step 7 and compute SHA256. The result is 32 bytes. Hex-encode as 64 lowercase characters.

fingerprint = SHA256(UTF-8(descriptor))

This is the value stored in element [1] of the X tag.

Why SHA256 the descriptor at the end? It produces a fixed-size 32-byte output compatible with Bitcoin timestamping services (which expect standard SHA256 hashes), and it makes the fingerprint look and behave like any other hash so existing tooling requires no special cases.

Summary: The Full Pipeline

plaintext content
  → NFKD normalize → strip diacritic marks → lowercase
  → normalize line endings
  → replace URLs with ' url '
  → remove zero-width chars
  → replace punctuation/symbols with spaces
  → replace numbers with ' num '
  → collapse whitespace → trim
  → extract /[\p{L}\p{N}]+/gu matches → raw token list
  → apply 4-rule stemmer to each token
  → filter: remove stop words and tokens shorter than 4 chars
    (fallback: if list is empty, skip filters, keep length > 0 tokens)
  → with shingleSize=1, each token is its own shingle
  → for each shingle: SHA256("eqs:" + shingle)
      → assign to bucket first_byte % 8
      → track lexicographically smallest hex per bucket
  → take last 3 chars of each bucket minimum
  → build descriptor: "minhash-equality-v1|n=1|b=8|k=3|m=4|b0:xxx|…|b7:xxx"
  → SHA256(UTF-8(descriptor)) → 64 hex chars = fingerprint (fingerprint_hex)

fingerprint → kind:1042 attestation event (X tag = fingerprint) → author signs via signEvent → attestation.sig
  → commitment_hex = SHA256(attestation.sig)
  → submitted to OpenTimestamps service
  → attestation serialized into the kind:1041 `description` tag
  → resulting .ots binary base64-encoded into kind:1041 .content

Algorithm Version

This specification defines the minhash-equality-v1 algorithm, named for the version string baked into every descriptor. The simhash-ts npm package (authored by the Fanfares team) uses a versioning policy where algorithm variants are never modified, only added. If the algorithm changes, the version string changes, and a new NIP-FF revision will specify it.

Deprecation of simhash-equality-v2. An earlier draft of this NIP specified simhash-equality-v2 (2 buckets, first-3-hex selection). The collision study in kb-private ADR-005 found it produced an unacceptable rate of long-content false positives (unrelated articles sharing a fingerprint). minhash-equality-v1 supersedes it. simhash-equality-v2 remains frozen in simhash-ts for backward compatibility but MUST NOT be used for new content. The two versions never collide in discovery: the version string is part of the hashed descriptor, so the same content produces a different X value (element [1], the hex a #X filter actually matches) under v1 than under v2, and a query for a v1 fingerprint never returns v2 events.

Implementors MUST produce bit-for-bit identical output to minhashEquality(text, MINHASH_EQUALITY_DEFAULTS) from simhash-ts. The algorithm above is the normative specification; the library is the reference implementation. The parameter values for MINHASH_EQUALITY_DEFAULTS are:

bitLength:             256
shingleSize:           1
bucketCount:           8
keptHexCharsPerBucket: 3
minTokenLength:        4

Looking for comments…

Searching Nostr relays. This may take a moment the first time this article is opened.