MIPT
Moscow Institute of
Physics and Technology
National Research University
Bachelor's Thesis Defense

Multimodal Deepfake Detection
in Video Conferencing

Dataset, architecture, and cross-dataset generalisation
Programme 01.03.02 · Applied Mathematics and Informatics · Phystech School of Applied Mathematics and Informatics
Author
Shukla Rituparn
Scientific supervisor
Makarov Ilya Andreevich
Defense
Moscow · 2026
01 · Motivation

The visual channel is no longer
proof of identity

Conferencing deepfakes are live, targeted, and interactive. They strike during synchronous calls on Zoom, Teams, and Google Meet, where the counterpart has seconds to judge authenticity and no recourse once deceived.

$

Wire fraud

Executives impersonated on business calls to authorize fraudulent transfers.

ID

Fake hires

Attackers posing as job candidates through remote interviews and onboarding.

KYC

Bypassed KYC

Remote Know-Your-Customer identity checks defeated at financial institutions.

!

Social attacks

Impersonation of family members targeting elderly relatives.

The threat is no longer offline. Real-time avatar generation on consumer hardware turns any video call into an attack surface.
Shukla R. · MMGUNet defense
01 · Motivation

Conferencing video breaks the
assumptions of existing benchmarks

Standard datasets come from social media and broadcast footage. The conferencing domain differs along four axes at the same time.

Codec compression

H.264 / H.265 at low-latency bitrates, calibrated for transmission rather than visual fidelity.

Constrained pose

Head pose concentrated on the frontal axis. Users are seated, facing a webcam, with limited angular variation.

Low resolution

Source capped by webcam hardware, rarely above 1080p and often well below.

Virtual backgrounds

Segmentation artifacts at the person-background boundary, absent from every standard benchmark.

A detector that has never seen this combination of conditions performs at chance on conferencing video.
Shukla R. · MMGUNet defense
01 · The gap

No prior dataset covers all three
requirements at once

A

Modern methods

Diffusion transformers and real-time autoregressive avatars from 2024-2026, not pre-2020 GAN swaps.

B

Conferencing domain

Codec compression, frontal pose, webcam resolution, and virtual-background artifacts together.

C

Multimodal annotation

Per-clip physiological, geometric, appearance, and semantic cues for interpretable detection.

The closest prior work, Zoom-DF (2022), provided ~1,500 fakes from one first-generation method, and is no longer publicly available. It is inadequate against the 2025-2026 threat landscape.

Shukla R. · MMGUNet defense
01 · Objectives

Three research problems

1

Dataset construction

Build the first conferencing-domain deepfake dataset with a multimodal annotation framework spanning physiological, geometric, appearance, and semantic cues.

Joint · team
2

Architecture design

Design and implement a multimodal architecture with joint classification and pixel-level localization, trainable on a single consumer-grade GPU.

My contribution
3

Generalization analysis

Quantify the contribution of conferencing-domain data to cross-dataset transfer through controlled ablation studies.

My contribution
Shukla R. · MMGUNet defense
01 · Scope of this thesis

What is mine, and what is the team's

VCDF-X dataset

Joint · team

Construction pipeline, generation, and the eight-modality annotation framework are a collective contribution. My role: co-investigator on the detection side, and the cross-dataset benchmarking that establishes the dataset's value.

MMGUNet + generalization study

My contribution

The four-modal architecture, the landmark-heatmap modality, the adaptive gating, the dual output head, the training pipeline, and the full cross-dataset ablation experiments are my own design and implementation.

Submitted to ACM Multimedia 2026, Dataset Track (Submission #335). VCDF-X: A Multimodal Explainable Video Conferencing Deepfake Dataset and Benchmark.
Shukla R. · MMGUNet defense
PART I · CONTEXT

VCDF-X
The dataset

A conferencing-domain deepfake benchmark with the richest per-sample annotation set in the detection literature.
Joint contribution · research team
02 · The VCDF-X dataset

A conferencing benchmark built
on the current threat landscape

13,768
total clips
9
generation pipelines (2024-2026)
8
annotation modalities per clip
4,151 / 9,617
real / fake clips
Fake-heavy by design: attackers generate freely, defenders work from a smaller authentic pool.
DatasetYearRealFakeMeth.
FaceForensics++20191,0004,0004
Celeb-DF20205905,6391
DFDC202023K100K8
ForgeryNet202199K121K15
Zoom-DF2022501,5001
DF402024100K40
VCDF-X (ours)20264,1519,6179
Only VCDF-X combines modern methods + the conferencing domain + 8 annotation modalities.
Shukla R. · MMGUNet defense
02 · Generation methods

Nine modern pipelines, three families

Face swap

Inswapper+GFPGANFaceFusionVisoMasterDreamID-VHyperswap

Reenactment

SadTalkerLiveAvatar · 14B

Autoregressive streaming: the live real-time threat.

Generative

Ovi · 11BLTX-2 · 19B

Text-to-video, no source footage needed.

LTX-2
Ovi
LiveAvatar
Representative synthetic frames. Every depicted person is fully generated; no real individual is shown.
All pipelines run in default configuration (non-expert attacker model), then post-processed with virtual-background replacement.
Shukla R. · MMGUNet defense
02 · Annotation framework

Eight modalities on every clip

Computed uniformly on real and fake samples to avoid label leakage. The richest per-sample annotation set in the literature.

Manipulation masks

Face-oval pseudo-masks from MediaPipe FaceMesh.

Facial action units

ME-GraphAU. AU25 lip-part elevated in fakes (0.724 vs 0.648).

Eye gaze

3D gaze vectors. Diagnostic for reenactment.

Depth maps

Depth Anything V2. Catches swaps that break 3D structure.

Face mesh

468-keypoint geometry for temporal consistency.

rPPG signal

PhysNet blood-volume pulse, absent in synthetics.

NL descriptions

Qwen2.5-VL-7B artifact text for explainability.

Demographics

Age, gender, ethnicity, emotion for fairness.

Masks and landmark geometry from this framework feed directly into the MMGUNet architecture I designed.
Shukla R. · MMGUNet defense
02 · Manipulation masks

Pseudo-masks from face-oval geometry

  • The 36 keypoints of the MediaPipe face oval form a polygon, rasterized into a binary mask per frame.
  • For face swaps and reenactments, the mask is a proxy for the manipulated region.
  • For fully generative content, the manipulated region is the full frame.
  • The same geometry is converted into a Gaussian landmark heatmap, the key input to MMGUNet.
mask overlay
mask overlay
Shukla R. · MMGUNet defense
PART II · MY PRIMARY CONTRIBUTION

MMGUNet
The architecture

A four-modal gated U-Net for joint classification and pixel-level localization, trainable on a single consumer GPU.
My design & implementation
03 · Architecture overview

The MMGUNet architecture, end to end

RESNET18 · IMAGENET PRETRAINED FUSION BOTTLENECK U-NET DECODER · LOCALIZATION CLASSIFICATION HEAD SKIP CONNECTIONS x₀ … x₃ [256,7²] [256,7²] Video frame [3, 224²] Grayscale → 2D FFT [1, 224²] Grayscale → STFT [1, 224²] FaceMesh → heatmap [1, 224²] ResNet18 encoder 1×1 conv · FRGB [256,7²] ModalBottleneck 1→32→…→256 · FFFT ModalBottleneck 1→32→…→256 · FSTFT ModalBottleneck 1→32→…→256 · FLM channel gate 1 + σ(·) channel gate 1 + σ(·) channel gate 1 + σ(·) channel gate 1 + σ(·) concat [1024,7²] 1×1 conv Ffused [256,7²] U-Net decoder 5 UpBlocks · 256→128→64→64→32 reuses ResNet skips x₀ … x₃ mask M̂ [1, 224²] GAP [256] Dropout p = 0.3 Linear 256 → 1 σ p(fake) [0, 1]
ResNet18 encoder ModalBottleneck channel gate (1+σ) fusion 1×1 conv U-Net decoder classification head tensor flow skip connection
Per-frame; 64 frames averaged into a video score. Trains in ~10-12 h on a single RTX A6000.
Shukla R. · MMGUNet defense
03 · Input modalities

Four complementary views of one frame

Raw RGB

Appearance pattern, ImageNet-normalized for the pretrained encoder.

\(\mathbf{I}_{\text{norm}} = \dfrac{\mathbf{I}-\boldsymbol{\mu}}{\boldsymbol{\sigma}}\)

FFT magnitude

Frequency artifacts of blending and compression; rings at high frequencies.

\(\mathbf{F}_{\text{FFT}} = \dfrac{\log(1+|\mathcal{F}_{2D}(g)|)-\mu_s}{\sigma_s+\varepsilon}\)

STFT spectrogram

Over a 1D-flattened frame; preserves where frequency anomalies occur.

\(n_{\text{fft}}=256,\ \text{hop}=128,\ \text{Hann} \to 224\times224\)

Landmark heatmap · NOVEL

Gaussian blobs at 36 face-oval keypoints: an explicit face-region prior.

\(H(u,v)=\max_k \exp\!\left(-\dfrac{(u-x_k)^2+(v-y_k)^2}{2\sigma^2}\right)\)
Shukla R. · MMGUNet defense
03 · Adaptive gating

A gate that amplifies, never zeroes

Channel gate · applied per modality
[C,H,W] [C] [C,1,1] F GAP FC  W σ 1+
\(\hat{\mathbf{F}}=\mathbf{F}\odot\bigl(1+\sigma(\mathbf{W}\,\mathrm{GAP}(\mathbf{F}))\bigr)\)

Standard sigmoid gate

Multiplier in \((0,1)\): the gate can only attenuate, and can drive a modality to zero before its weights stabilize.

Our \((1+\sigma)\) gate

Multiplier in \((1,2)\): the gate amplifies, with a neutral fallback of 1.0. No modality is ever zeroed out.

Four independent gates, no parameter sharing: each modality learns its own contribution profile per sample.
Shukla R. · MMGUNet defense
03 · Output heads & training

Dual head, three-term loss

Classification head
\(p(\text{fake})=\sigma\!\bigl(\mathbf{w}^{\top}\mathrm{Dropout}_{0.3}(\mathrm{AvgPool}(\mathbf{F}_{\text{fused}}))\bigr)\)
Video-level aggregation
\(s_{\text{video}}=\dfrac{1}{64}\sum_{i=1}^{64}p_i\)
  • U-Net decoder reuses ResNet skips \(x_0\dots x_3\) for a \(224\times224\) mask.
  • AdamW, lr \(10^{-4}\), warmup + cosine, 15 epochs, batch 32.
Total loss
\[\mathcal{L}=\mathcal{L}_{\text{cls}}+0.3\,\mathcal{L}_{\text{weak}}+0.5\,\mathcal{L}_{\text{sup}}\]
1.0

Classification

BCE with label smoothing 0.1. Carries the primary training signal.

0.3

Weak supervision

On the mean predicted mask. Low-information, so downweighted.

0.5

Strong supervision

Pixel-level BCE on approximate pseudo-masks, downweighted to protect accuracy.

Shukla R. · MMGUNet defense
PART III · MY PRIMARY CONTRIBUTION

Experiments
& results

Three controlled ablations, a baseline comparison, and the central cross-dataset finding.
My experiments & analysis
04 · Experimental setup

Three datasets, three domains

FaceForensics++

Broadcast. 127K frames, 4 first/second-gen methods.

Celeb-DF

Social media. 179K frames, high-quality second-gen swap.

VCDF-X

Conferencing. 13,768 clips, 9 third-gen methods.

Metrics
AUC · primarymacro-F1EERIoU · localization
Baselines (same recipe)
ResNet-50MViT-V2-SViT-BaseSwin-V2-BMDF · FAU+rPPG
Identical recipe across conditions (optimizer, lr, schedule, seed 42). The only variable is the one each study isolates.
Shukla R. · MMGUNet defense
04 · Ablation 1 of 3 · backbone

ImageNet pretraining is the
single largest gain

VersionKey changeAUCIoU
v1Random initialization0.9390.672
v2Balanced sampler + method head0.9130.643
v3LSTM over 8 frames0.8650.513
v41-frame train, 64-frame infer0.9420.672
v5 · ResNet18ImageNet pretrain0.9860.759
v5 · EffNetEfficientNet-B0 pretrained0.9790.756
v6 · ViT-B/16Vision Transformer0.9730.717
SWIN-TSWIN + multi-scale masks0.9460.175
+0.047
AUC from random init → ImageNet pretrain (0.939 → 0.986)
  • CNNs beat transformers at this data scale.
  • LSTM hurts: 8 frames too short to learn dynamics.
3-modality architecture (RGB + FFT + STFT), single-dataset training on VCDF-X.
Shukla R. · MMGUNet defense
04 · Ablation 2 of 3 · modality

The landmark heatmap drives
the largest single gain

ModalitiesVCDF-XFF++CelebDFIoU
RGB only0.9620.9170.9840.701
+ FFT0.9740.9260.9890.722
+ STFT0.9720.9370.9930.731
+ Landmark (full)0.9990.9330.9930.956
+0.225
localization IoU from adding the landmark heatmap (0.731 → 0.956)
+0.027
VCDF-X AUC, concentrated on the conferencing domain
The explicit face prior decouples the face signal from background appearance, exactly where conferencing video differs most.
Shukla R. · MMGUNet defense
04 · Ablation 3 of 3 · the central finding

Standard benchmarks alone score
below chance on conferencing

Training dataVCDF-XFF++CelebDF
FF++ + CelebDF only0.4540.9480.995
VCDF-X only0.9990.5100.570
All three combined0.9990.9330.993
Full 4-modal architecture. The only variable is the training-set composition.
0.454
AUC · FF++ & CelebDF → VCDF-X
Worse than a random classifier on conferencing video
Any conferencing detector must be trained on conferencing-domain data. Combined training closes the gap to a fraction of a percentage point on every dataset.
Shukla R. · MMGUNet defense
04 · Comparison with baselines

Matches or beats MDF on simpler inputs

ModelTrainTestAUCF1
In-domain (train & test on VCDF-X)
ResNet-50VCDF-XVCDF-X0.9740.975
Swin-V2-BaseVCDF-XVCDF-X0.9970.975
MDF (FAU + rPPG)VCDF-XVCDF-X0.9800.907
MMGUNet (3-modal)VCDF-XVCDF-X0.9830.950
Combined training → per-domain test
MDFAll threeFF++0.9350.808
MDFAll threeCelebDF0.9980.981
MDFAll threeVCDF-X0.9750.907
MMGUNet (4-modal)All threeFF++0.9330.944
MMGUNet (4-modal)All threeCelebDF0.9930.988
MMGUNet (4-modal)All threeVCDF-X0.9990.993
Shukla R. · MMGUNet defense
05 · Discussion

Three findings, three mechanisms

1

Pretraining dominates

ImageNet features regularize the encoder against dataset-specific noise. Larger gain than any architectural change.

2

Geometry prior helps most

The landmark heatmap decouples the face signal from background, exactly where conferencing differs.

3

Combined training is required

No implicit transfer from broadcast GAN content to conferencing diffusion content.

For the field: single-benchmark AUC above 0.99 is an in-distribution upper bound, not a deployment estimate. Cross-domain evaluation should be a default reporting requirement.
Shukla R. · MMGUNet defense
06 · Conclusion

Contributions & what comes next

VCDF-X

Joint

First conferencing-domain benchmark, 8 modalities. Submitted to ACM MM 2026.

MMGUNet

Mine

AUC 0.999 / 0.933 / 0.993 under combined training. Beats MDF on simpler inputs.

Generalization

Mine

First quantitative measure: single-source fails below chance on conferencing.

Future work
Integrate rPPG + FAU modalitiesAttention-based temporal modelingReal-time inference (quantization, distillation)LLM-grounded explanationsAdversarial robustness
Shukla R. · MMGUNet defense
MIPT
ACM Multimedia 2026
Dataset Track · #335
Thank you

Questions are welcome

Architecture · experiments · dataset · limitations
Author
Shukla Rituparn
Supervisor
Makarov Ilya Andreevich
Contribution
MMGUNet + generalization analysis
APPENDIX · HELD FOR Q&A

Backup
slides

Per-method analysis, failure modes, threats to validity, limitations, and reproducibility.
Backup · per-method analysis

Mean score by source label

Source labelMean scoreN
Real0.288164
LTX-2 (AIGC)0.786157
Ovi (AIGC)~<30*
dlc (Inswapper+GFPGAN)0.59844
LiveAvatar (14B)0.56892
FaceFusion / VisoMaster / DreamID-V~<30*
  • Real well-separated from every fake label.
  • AIGC easiest: no natural acquisition noise to mimic.
  • Hardest: LiveAvatar (no blend boundary) and dlc (GFPGAN smooths the FFT signature).
  • * <30 videos: reported for completeness, not statistically interpretable.
Mean video-level score on a 500-clip subsample. Mean scores are not directly comparable to dataset-level AUC.
Shukla R. · MMGUNet defense
Backup · failure modes

Where the detector errs

Unusual lighting → fake

Low-light and backlit scenes: natural shadow boundaries read as blending artifacts.

Short LiveAvatar → real

Clips under 12s lack the temporal signature the model relies on; autoregressive output is incoherent early.

Heavy makeup → fake

Extreme appearance modifications and occlusion were unseen in training and look like artifacts.

These motivate the temporal-modeling and adversarial-robustness directions in future work.
Shukla R. · MMGUNet defense
Backup · threats to validity

Split granularity & in-domain leakage

The caveat

Clip-level random split does not guarantee source disjointness: long recordings were segmented into sibling clips, and swaps share a synthetic identity pool. In-domain AUC ≈ 0.999 is an optimistic upper bound.

Why the finding holds

The generalization claims are cross-dataset. Transfer failures are measured on FF++ and CelebDF, which share no source, identities, or pipelines with VCDF-X. Those numbers are independent of the VCDF-X split.

A source-level / identity-level partition of VCDF-X would tighten the in-domain figures. Left for future work.
Shukla R. · MMGUNet defense
Backup · limitations

Stated plainly

  • Simulated transmission: offline segmentation and post-hoc compression, not real Zoom/Teams pipelines.
  • Demographic imbalance: reflects source YouTube distribution; some subgroups underrepresented.
  • STFT semantics: non-standard on a flattened image; a 2D wavelet basis would be more principled.
  • No temporal modeling: mean aggregation discards temporal statistics; LSTM hurt.
  • Single-GPU scale: results are a single-GPU baseline, not a scale-frontier result.
  • Real-time gap: ~60-80 FPS on the training GPU; deployment needs quantization or a smaller backbone.
Shukla R. · MMGUNet defense
Backup · reproducibility

Compute & protocol

RTX A6000, 48GB, Slurm (ITMO AI Talent Hub)
~11h
per configuration, single GPU, batch 32
64
uniformly-spaced frames at inference
15
epochs, AdamW, warmup + cosine
Dataset access was restricted to reviewers by the owner's policy. MMGUNet code and weights release with the camera-ready paper.
Shukla R. · MMGUNet defense