Blockchain security isn't optional.

Protect your smart contracts and DeFi protocols with Three Sigma, a trusted security partner in blockchain audits, smart contract vulnerability assessments, and Web3 security.

Get a Quote Today

GAN & Diffusion in Meme Coins

Attackers today can scrape a few seconds of an influencer’s video (e.g. from YouTube or TikTok) and use those frames to fine-tune a generative model. By training a LoRA (Low-Rank Adaptation) adapter on the target’s face and style, the attacker injects small trainable matrices into a pre-trained diffusion network. This effectively “locks in” the influencer’s visage without retraining the whole model. In practice one might use a service like OneShotLoRA to upload a short clip of the influencer and obtain a specialized image model. With that model, the attacker runs a standard diffusion pipeline: encode noise into a latent space, iteratively denoise conditioned on prompts like “InfluencerName announces exclusive NFT drop,” and finally decode to RGB frames. The frames can be sequenced into a short promo video. Optionally, a video-specific diffusion network or temporal smoothing is used to ensure frame-to-frame consistency. The result is a convincing pseudo-endorsed promo, complete with the influencer’s face and voice clone as needed.

  • Frame Extraction: Sample keyframes from the influencer’s existing videos, isolating the face and hands against various backgrounds. Preprocess by aligning and cropping to focus on the subject.
  • LoRA Fine-Tuning: Inject trainable rank-limited weight matrices into a diffusion model’s attention layers. Only the LoRA weights update during training, adapting the model’s output to mimic the influencer’s features. Because LoRA is lightweight, one can train on a few dozen images without expensive compute.
  • Diffusion Generation: Run the denoising diffusion process with text or image prompts. For example, prompt the model to generate frames of the influencer speaking about a new NFT or token whitelist. The model yields high-resolution frames reflecting the influencer’s likeness and tone.
  • Video Assembly: Stitch the generated frames into a video, applying interpolation or a frame-level diffusion refinement to smooth motion. Some pipelines even use a secondary “video diffusion” step (e.g. training a content-motion latent diffusion model) to enforce temporal coherence.

Such faked content can then be injected into social-token ecosystems. For example, an attacker might post a deepfake “AMA” video or newsletter image, claiming that a verified influencer is backing a new ERC‑20 token or NFT whitelist. Since social tokens often rely on influencer marketing, these fake endorsements can create a sudden surge of hype. In fact, reports note that AI-driven scams are using deepfakes to flood social media with fake endorsements and hype around specific crypto tokens. By amplifying “pump” narratives with viral video, the scammer can drive a pump-and-dump of a community currency or rug pull an NFT pre-sale before the ruse is detected.

Voice-Clone Scams

image

In the audio realm, attackers exploit advanced voice-cloning pipelines to impersonate trusted figures. Modern systems work by converting target speech into spectrogram-like feature representations, extracting parameters such as timbre, pitch contour, and prosody, and then feeding noise through a neural vocoder to produce speech. For example, expressive neural voice cloning models encode pitch and emotion via latent style tokens and explicitly model the pitch contour of the speaker. With just a few seconds of a person’s voice (captured from public videos, old calls or compromised recordings), an attacker can train a synthetic voice that mimics that person’s spectral fingerprint remarkably well.

This technique was at the core of a Hong Kong case in early 2025: scammers compromised a finance manager’s account and used an AI-generated deepfake of his voice to deliver WhatsApp instructions. The victim received voice-memo “instructions” supposedly from the manager, guiding them to send cryptocurrency (USDT) payments. Because the tone, intonation, and even pitch contour matched the real manager’s style, the victim obeyed and wired ~HK$145M in a few transactions. Only after the money was gone did they realize the voice had been fully synthesized.

From a technical standpoint, the attacker’s audio workflow involved:

  • Feature Extraction: Convert the target’s speech sample into a log-mel spectrogram (and possibly additional prosodic features). These spectral features capture the voice’s resonant peaks and inflections.
  • Model Training: Fine-tune a Text-to-Speech or voice-conversion model using the extracted features (or use a few-shot speaker encoder like in “Nautilus” or adaptive TTS). The network learns to reproduce the voice’s pitch contour and timbre.
  • Synthesis: Input scripted transaction instructions (text or phonemes) to generate an audio file. A neural vocoder (e.g. WaveNet variant) converts the modified spectrogram back into a waveform, now sounding like the target speaker.
  • Delivery: Send the synthetic voice clip over an instant-message channel or phone call, bypassing traditional authentication.

Crucially, this attack hinges on “trust on first use” (TOFU). The victim had no second-channel check, no quick callback, text code, or similar out-of-band verification, so the fake voice was accepted at face value. Requiring that extra channel is one of the simplest and most effective ways to stop voice-clone scams.

A “pre-shared voice print” is an enrolled biometric template, and “out-of-band verification” means confirming the request over a different channel.

The cloned voice simply slipped through the authentication: as one analysis notes, today’s AI clones can match stored biometric templates “closely” enough to pass both passive and active voice checks. In other words, legacy voice-verification systems assume human speech variation and simply did not anticipate an adversary could perfectly mimic a person’s spectral signature. Once the victim believed the deepfake voice was genuine, no additional challenge was issued.

To defeat such scams, defenders look at spectrogram analysis and pitch-pattern anomalies. Forensics might compare the pitch contour of the message against historical records, or run anti-spoofing models trained to spot the residual artifacts of neural synthesis. But as [6] reports, modern TTS clones often produce “cleaner” and more stable speech than a real human (ironically), making purely acoustic detection difficult. In the DAO context, any system that relies on voice calls for authentication would be similarly vulnerable unless it implements a challenge-response or liveness test.

Detection & Mitigation Tooling

Combating AI impersonation in Web3 requires provenance and anomaly detection. One key strategy is Content Authenticity: embed tamper-evident metadata at creation time. Initiatives like Adobe’s Content Authenticity Initiative (CAI) and the C2PA standard define a content credential, a signed record of an asset’s origin, edits, and AI use. This credential includes cryptographic hashes of the content, descriptions of creation tools, and edit history. In practice, a generator (e.g. image editor) can attach this JSON manifest to a JPEG or video. The manifest’s integrity is protected by Merkle-tree hashing and digital signatures.

For Web3, these hashes can be posted on-chain to cement lineage. For example, a gallery NFT could store the hash of the original image’s content credential in its mint transaction. Any later claim that the image is authentic can be verified by re-computing the hash from the current file and comparing it to the on-chain record. Blockchains thus provide a public bulletin board: once the provenance certificate’s root hash is recorded, any tampering with the file (or its claimed edit history) breaks the hash chain. Open standards allow different tools to interoperate: a CAI/ C2PA manifest can embed in standard metadata fields (XMP/IPTC), or be indexed off-chain with on-chain pointersg.

Provenance metadata can’t be stamped onto a stream that’s unfolding in real time, so the defense shifts to out-of-band verification and liveness challenges. The moment a high-risk request arrives over a call or video, pause and confirm it through a second, independent channel you already trust (e-mail, secure chat, or a callback to the number on file). Pair that with quick liveness tests, ask the caller to repeat a random pass-phrase, move the camera to show a specific object, or send a one-time code, steps a pre-rendered deepfake can’t satisfy on the fly. These simple two-channel or challenge-response checks are the most practical way to stop live voice-clone or video-clone scams.

Simeon Cholakov
Simeon Cholakov

Security Researcher

Niccolo Pozzolini
Niccolo Pozzolini

Smart Contract Audit Lead

a