Moscow Institute of Physics and Technology National Research University
Bachelor's Thesis Defense
Multimodal Deepfake Detection in Video Conferencing
Dataset, architecture, and cross-dataset generalisation
Programme 01.03.02 · Applied Mathematics and Informatics · Phystech School of Applied Mathematics and Informatics
Author
Shukla Rituparn
Scientific supervisor
Makarov Ilya Andreevich
Defense
Moscow · 2026
01 · Motivation
The visual channel is no longer proof of identity
Conferencing deepfakes are live, targeted, and interactive. They strike during synchronous calls on Zoom, Teams, and Google Meet, where the counterpart has seconds to judge authenticity and no recourse once deceived.
$
Wire fraud
Executives impersonated on business calls to authorize fraudulent transfers.
ID
Fake hires
Attackers posing as job candidates through remote interviews and onboarding.
KYC
Bypassed KYC
Remote Know-Your-Customer identity checks defeated at financial institutions.
!
Social attacks
Impersonation of family members targeting elderly relatives.
The threat is no longer offline. Real-time avatar generation on consumer hardware turns any video call into an attack surface.
Shukla R. · MMGUNet defense
01 · Motivation
Conferencing video breaks the assumptions of existing benchmarks
Standard datasets come from social media and broadcast footage. The conferencing domain differs along four axes at the same time.
Codec compression
H.264 / H.265 at low-latency bitrates, calibrated for transmission rather than visual fidelity.
Constrained pose
Head pose concentrated on the frontal axis. Users are seated, facing a webcam, with limited angular variation.
Low resolution
Source capped by webcam hardware, rarely above 1080p and often well below.
Virtual backgrounds
Segmentation artifacts at the person-background boundary, absent from every standard benchmark.
A detector that has never seen this combination of conditions performs at chance on conferencing video.
Shukla R. · MMGUNet defense
01 · The gap
No prior dataset covers all three requirements at once
A
Modern methods
Diffusion transformers and real-time autoregressive avatars from 2024-2026, not pre-2020 GAN swaps.
B
Conferencing domain
Codec compression, frontal pose, webcam resolution, and virtual-background artifacts together.
C
Multimodal annotation
Per-clip physiological, geometric, appearance, and semantic cues for interpretable detection.
The closest prior work, Zoom-DF (2022), provided ~1,500 fakes from one first-generation method, and is no longer publicly available. It is inadequate against the 2025-2026 threat landscape.
Shukla R. · MMGUNet defense
01 · Objectives
Three research problems
1
Dataset construction
Build the first conferencing-domain deepfake dataset with a multimodal annotation framework spanning physiological, geometric, appearance, and semantic cues.
Joint · team
2
Architecture design
Design and implement a multimodal architecture with joint classification and pixel-level localization, trainable on a single consumer-grade GPU.
My contribution
3
Generalization analysis
Quantify the contribution of conferencing-domain data to cross-dataset transfer through controlled ablation studies.
My contribution
Shukla R. · MMGUNet defense
01 · Scope of this thesis
What is mine, and what is the team's
VCDF-X dataset
Joint · team
Construction pipeline, generation, and the eight-modality annotation framework are a collective contribution. My role: co-investigator on the detection side, and the cross-dataset benchmarking that establishes the dataset's value.
MMGUNet + generalization study
My contribution
The four-modal architecture, the landmark-heatmap modality, the adaptive gating, the dual output head, the training pipeline, and the full cross-dataset ablation experiments are my own design and implementation.
Submitted to ACM Multimedia 2026, Dataset Track (Submission #335). VCDF-X: A Multimodal Explainable Video Conferencing Deepfake Dataset and Benchmark.
Shukla R. · MMGUNet defense
PART I · CONTEXT
VCDF-X The dataset
A conferencing-domain deepfake benchmark with the richest per-sample annotation set in the detection literature.
Joint contribution · research team
02 · The VCDF-X dataset
A conferencing benchmark built on the current threat landscape
13,768
total clips
9
generation pipelines (2024-2026)
8
annotation modalities per clip
4,151 / 9,617
real / fake clips
Fake-heavy by design: attackers generate freely, defenders work from a smaller authentic pool.
Dataset
Year
Real
Fake
Meth.
FaceForensics++
2019
1,000
4,000
4
Celeb-DF
2020
590
5,639
1
DFDC
2020
23K
100K
8
ForgeryNet
2021
99K
121K
15
Zoom-DF
2022
50
1,500
1
DF40
2024
–
100K
40
VCDF-X (ours)
2026
4,151
9,617
9
Only VCDF-X combines modern methods + the conferencing domain + 8 annotation modalities.
Social media. 179K frames, high-quality second-gen swap.
VCDF-X
Conferencing. 13,768 clips, 9 third-gen methods.
Metrics
AUC · primarymacro-F1EERIoU · localization
Baselines (same recipe)
ResNet-50MViT-V2-SViT-BaseSwin-V2-BMDF · FAU+rPPG
Identical recipe across conditions (optimizer, lr, schedule, seed 42). The only variable is the one each study isolates.
Shukla R. · MMGUNet defense
04 · Ablation 1 of 3 · backbone
ImageNet pretraining is the single largest gain
Version
Key change
AUC
IoU
v1
Random initialization
0.939
0.672
v2
Balanced sampler + method head
0.913
0.643
v3
LSTM over 8 frames
0.865
0.513
v4
1-frame train, 64-frame infer
0.942
0.672
v5 · ResNet18
ImageNet pretrain
0.986
0.759
v5 · EffNet
EfficientNet-B0 pretrained
0.979
0.756
v6 · ViT-B/16
Vision Transformer
0.973
0.717
SWIN-T
SWIN + multi-scale masks
0.946
0.175
+0.047
AUC from random init → ImageNet pretrain (0.939 → 0.986)
CNNs beat transformers at this data scale.
LSTM hurts: 8 frames too short to learn dynamics.
3-modality architecture (RGB + FFT + STFT), single-dataset training on VCDF-X.
Shukla R. · MMGUNet defense
04 · Ablation 2 of 3 · modality
The landmark heatmap drives the largest single gain
Modalities
VCDF-X
FF++
CelebDF
IoU
RGB only
0.962
0.917
0.984
0.701
+ FFT
0.974
0.926
0.989
0.722
+ STFT
0.972
0.937
0.993
0.731
+ Landmark (full)
0.999
0.933
0.993
0.956
+0.225
localization IoU from adding the landmark heatmap (0.731 → 0.956)
+0.027
VCDF-X AUC, concentrated on the conferencing domain
The explicit face prior decouples the face signal from background appearance, exactly where conferencing video differs most.
Shukla R. · MMGUNet defense
04 · Ablation 3 of 3 · the central finding
Standard benchmarks alone score below chance on conferencing
Training data
VCDF-X
FF++
CelebDF
FF++ + CelebDF only
0.454
0.948
0.995
VCDF-X only
0.999
0.510
0.570
All three combined
0.999
0.933
0.993
Full 4-modal architecture. The only variable is the training-set composition.
0.454
AUC · FF++ & CelebDF → VCDF-X
Worse than a random classifier on conferencing video
Any conferencing detector must be trained on conferencing-domain data. Combined training closes the gap to a fraction of a percentage point on every dataset.
Shukla R. · MMGUNet defense
04 · Comparison with baselines
Matches or beats MDF on simpler inputs
Model
Train
Test
AUC
F1
In-domain (train & test on VCDF-X)
ResNet-50
VCDF-X
VCDF-X
0.974
0.975
Swin-V2-Base
VCDF-X
VCDF-X
0.997
0.975
MDF (FAU + rPPG)
VCDF-X
VCDF-X
0.980
0.907
MMGUNet (3-modal)
VCDF-X
VCDF-X
0.983
0.950
Combined training → per-domain test
MDF
All three
FF++
0.935
0.808
MDF
All three
CelebDF
0.998
0.981
MDF
All three
VCDF-X
0.975
0.907
MMGUNet (4-modal)
All three
FF++
0.933
0.944
MMGUNet (4-modal)
All three
CelebDF
0.993
0.988
MMGUNet (4-modal)
All three
VCDF-X
0.999
0.993
Shukla R. · MMGUNet defense
05 · Discussion
Three findings, three mechanisms
1
Pretraining dominates
ImageNet features regularize the encoder against dataset-specific noise. Larger gain than any architectural change.
2
Geometry prior helps most
The landmark heatmap decouples the face signal from background, exactly where conferencing differs.
3
Combined training is required
No implicit transfer from broadcast GAN content to conferencing diffusion content.
For the field: single-benchmark AUC above 0.99 is an in-distribution upper bound, not a deployment estimate. Cross-domain evaluation should be a default reporting requirement.
Shukla R. · MMGUNet defense
06 · Conclusion
Contributions & what comes next
VCDF-X
Joint
First conferencing-domain benchmark, 8 modalities. Submitted to ACM MM 2026.
MMGUNet
Mine
AUC 0.999 / 0.933 / 0.993 under combined training. Beats MDF on simpler inputs.
Generalization
Mine
First quantitative measure: single-source fails below chance on conferencing.
Per-method analysis, failure modes, threats to validity, limitations, and reproducibility.
Backup · per-method analysis
Mean score by source label
Source label
Mean score
N
Real
0.288
164
LTX-2 (AIGC)
0.786
157
Ovi (AIGC)
~
<30*
dlc (Inswapper+GFPGAN)
0.598
44
LiveAvatar (14B)
0.568
92
FaceFusion / VisoMaster / DreamID-V
~
<30*
Real well-separated from every fake label.
AIGC easiest: no natural acquisition noise to mimic.
Hardest: LiveAvatar (no blend boundary) and dlc (GFPGAN smooths the FFT signature).
* <30 videos: reported for completeness, not statistically interpretable.
Mean video-level score on a 500-clip subsample. Mean scores are not directly comparable to dataset-level AUC.
Shukla R. · MMGUNet defense
Backup · failure modes
Where the detector errs
Unusual lighting → fake
Low-light and backlit scenes: natural shadow boundaries read as blending artifacts.
Short LiveAvatar → real
Clips under 12s lack the temporal signature the model relies on; autoregressive output is incoherent early.
Heavy makeup → fake
Extreme appearance modifications and occlusion were unseen in training and look like artifacts.
These motivate the temporal-modeling and adversarial-robustness directions in future work.
Shukla R. · MMGUNet defense
Backup · threats to validity
Split granularity & in-domain leakage
The caveat
Clip-level random split does not guarantee source disjointness: long recordings were segmented into sibling clips, and swaps share a synthetic identity pool. In-domain AUC ≈ 0.999 is an optimistic upper bound.
Why the finding holds
The generalization claims are cross-dataset. Transfer failures are measured on FF++ and CelebDF, which share no source, identities, or pipelines with VCDF-X. Those numbers are independent of the VCDF-X split.
A source-level / identity-level partition of VCDF-X would tighten the in-domain figures. Left for future work.
Shukla R. · MMGUNet defense
Backup · limitations
Stated plainly
Simulated transmission: offline segmentation and post-hoc compression, not real Zoom/Teams pipelines.
Demographic imbalance: reflects source YouTube distribution; some subgroups underrepresented.
STFT semantics: non-standard on a flattened image; a 2D wavelet basis would be more principled.
No temporal modeling: mean aggregation discards temporal statistics; LSTM hurt.
Single-GPU scale: results are a single-GPU baseline, not a scale-frontier result.
Real-time gap: ~60-80 FPS on the training GPU; deployment needs quantization or a smaller backbone.
Shukla R. · MMGUNet defense
Backup · reproducibility
Compute & protocol
6×
RTX A6000, 48GB, Slurm (ITMO AI Talent Hub)
~11h
per configuration, single GPU, batch 32
64
uniformly-spaced frames at inference
15
epochs, AdamW, warmup + cosine
Dataset access was restricted to reviewers by the owner's policy. MMGUNet code and weights release with the camera-ready paper.