Moscow Institute of
Physics and Technology
National Research University

Bachelor's Thesis Defense

Multimodal Deepfake Detection
in Video Conferencing

Dataset, architecture, and cross-dataset generalisation

Programme 01.03.02 · Applied Mathematics and Informatics · Phystech School of Applied Mathematics and Informatics

Author

Shukla Rituparn

Scientific supervisor

Makarov Ilya Andreevich

Defense

Moscow · 2026

01 · Motivation

The visual channel is no longer
proof of identity

Conferencing deepfakes are live, targeted, and interactive. They strike during synchronous calls on Zoom, Teams, and Google Meet, where the counterpart has seconds to judge authenticity and no recourse once deceived.

Wire fraud

Executives impersonated on business calls to authorize fraudulent transfers.

Fake hires

Attackers posing as job candidates through remote interviews and onboarding.

KYC

Bypassed KYC

Remote Know-Your-Customer identity checks defeated at financial institutions.

Social attacks

Impersonation of family members targeting elderly relatives.

The threat is no longer offline. Real-time avatar generation on consumer hardware turns any video call into an attack surface.

Shukla R. · MMGUNet defense

01 · Motivation

Conferencing video breaks the
assumptions of existing benchmarks

Standard datasets come from social media and broadcast footage. The conferencing domain differs along four axes at the same time.

Codec compression

H.264 / H.265 at low-latency bitrates, calibrated for transmission rather than visual fidelity.

Constrained pose

Head pose concentrated on the frontal axis. Users are seated, facing a webcam, with limited angular variation.

Low resolution

Source capped by webcam hardware, rarely above 1080p and often well below.

Virtual backgrounds

Segmentation artifacts at the person-background boundary, absent from every standard benchmark.

A detector that has never seen this combination of conditions performs at chance on conferencing video.

Shukla R. · MMGUNet defense

01 · The gap

No prior dataset covers all three
requirements at once

Modern methods

Diffusion transformers and real-time autoregressive avatars from 2024-2026, not pre-2020 GAN swaps.

Conferencing domain

Codec compression, frontal pose, webcam resolution, and virtual-background artifacts together.

Multimodal annotation

Per-clip physiological, geometric, appearance, and semantic cues for interpretable detection.

The closest prior work, Zoom-DF (2022), provided ~1,500 fakes from one first-generation method, and is no longer publicly available. It is inadequate against the 2025-2026 threat landscape.

Shukla R. · MMGUNet defense

01 · Objectives

Three research problems

Dataset construction

Build the first conferencing-domain deepfake dataset with a multimodal annotation framework spanning physiological, geometric, appearance, and semantic cues.

Joint · team

Architecture design

Design and implement a multimodal architecture with joint classification and pixel-level localization, trainable on a single consumer-grade GPU.

My contribution

Generalization analysis

Quantify the contribution of conferencing-domain data to cross-dataset transfer through controlled ablation studies.

My contribution

Shukla R. · MMGUNet defense

01 · Scope of this thesis

What is mine, and what is the team's

VCDF-X dataset

Joint · team

Construction pipeline, generation, and the eight-modality annotation framework are a collective contribution. My role: co-investigator on the detection side, and the cross-dataset benchmarking that establishes the dataset's value.

MMGUNet + generalization study

My contribution

The four-modal architecture, the landmark-heatmap modality, the adaptive gating, the dual output head, the training pipeline, and the full cross-dataset ablation experiments are my own design and implementation.

Submitted to ACM Multimedia 2026, Dataset Track (Submission #335). VCDF-X: A Multimodal Explainable Video Conferencing Deepfake Dataset and Benchmark.

Shukla R. · MMGUNet defense

02 · The VCDF-X dataset

A conferencing benchmark built
on the current threat landscape

13,768

total clips

generation pipelines (2024-2026)

annotation modalities per clip

4,151 / 9,617

real / fake clips

Fake-heavy by design: attackers generate freely, defenders work from a smaller authentic pool.

Dataset	Year	Real	Fake	Meth.
FaceForensics++	2019	1,000	4,000	4
Celeb-DF	2020	590	5,639	1
DFDC	2020	23K	100K	8
ForgeryNet	2021	99K	121K	15
Zoom-DF	2022	50	1,500	1
DF40	2024	–	100K	40
VCDF-X (ours)	2026	4,151	9,617	9

Only VCDF-X combines modern methods + the conferencing domain + 8 annotation modalities.

Shukla R. · MMGUNet defense

02 · Generation methods

Nine modern pipelines, three families

Face swap

Inswapper+GFPGANFaceFusionVisoMasterDreamID-VHyperswap

Reenactment

SadTalkerLiveAvatar · 14B

Autoregressive streaming: the live real-time threat.

Generative

Ovi · 11BLTX-2 · 19B

Text-to-video, no source footage needed.

LTX-2

Ovi

LiveAvatar

Representative synthetic frames. Every depicted person is fully generated; no real individual is shown.

All pipelines run in default configuration (non-expert attacker model), then post-processed with virtual-background replacement.

Shukla R. · MMGUNet defense

02 · Annotation framework

Eight modalities on every clip

Computed uniformly on real and fake samples to avoid label leakage. The richest per-sample annotation set in the literature.

Manipulation masks

Face-oval pseudo-masks from MediaPipe FaceMesh.

Facial action units

ME-GraphAU. AU25 lip-part elevated in fakes (0.724 vs 0.648).

Eye gaze

3D gaze vectors. Diagnostic for reenactment.

Depth maps

Depth Anything V2. Catches swaps that break 3D structure.

Face mesh

468-keypoint geometry for temporal consistency.

rPPG signal

PhysNet blood-volume pulse, absent in synthetics.

NL descriptions

Qwen2.5-VL-7B artifact text for explainability.

Demographics

Age, gender, ethnicity, emotion for fairness.

Masks and landmark geometry from this framework feed directly into the MMGUNet architecture I designed.

Shukla R. · MMGUNet defense

02 · Manipulation masks

Pseudo-masks from face-oval geometry

The 36 keypoints of the MediaPipe face oval form a polygon, rasterized into a binary mask per frame.
For face swaps and reenactments, the mask is a proxy for the manipulated region.
For fully generative content, the manipulated region is the full frame.
The same geometry is converted into a Gaussian landmark heatmap, the key input to MMGUNet.

mask overlay

Shukla R. · MMGUNet defense

03 · Architecture overview

The MMGUNet architecture, end to end

ResNet18 encoder ModalBottleneck channel gate (1+σ) fusion 1×1 conv U-Net decoder classification head tensor flow skip connection

Per-frame; 64 frames averaged into a video score. Trains in ~10-12 h on a single RTX A6000.

Shukla R. · MMGUNet defense

03 · Input modalities

Four complementary views of one frame

Raw RGB

Appearance pattern, ImageNet-normalized for the pretrained encoder.

\(\mathbf{I}_{\text{norm}} = \dfrac{\mathbf{I}-\boldsymbol{\mu}}{\boldsymbol{\sigma}}\)

FFT magnitude

Frequency artifacts of blending and compression; rings at high frequencies.

\(\mathbf{F}_{\text{FFT}} = \dfrac{\log(1+|\mathcal{F}_{2D}(g)|)-\mu_s}{\sigma_s+\varepsilon}\)

STFT spectrogram

Over a 1D-flattened frame; preserves where frequency anomalies occur.

\(n_{\text{fft}}=256,\ \text{hop}=128,\ \text{Hann} \to 224\times224\)

Landmark heatmap · NOVEL

Gaussian blobs at 36 face-oval keypoints: an explicit face-region prior.

\(H(u,v)=\max_k \exp\!\left(-\dfrac{(u-x_k)^2+(v-y_k)^2}{2\sigma^2}\right)\)

Shukla R. · MMGUNet defense

03 · Adaptive gating

A gate that amplifies, never zeroes

Channel gate · applied per modality

\(\hat{\mathbf{F}}=\mathbf{F}\odot\bigl(1+\sigma(\mathbf{W}\,\mathrm{GAP}(\mathbf{F}))\bigr)\)

Standard sigmoid gate

Multiplier in \((0,1)\): the gate can only attenuate, and can drive a modality to zero before its weights stabilize.

Our \((1+\sigma)\) gate

Multiplier in \((1,2)\): the gate amplifies, with a neutral fallback of 1.0. No modality is ever zeroed out.

Four independent gates, no parameter sharing: each modality learns its own contribution profile per sample.

Shukla R. · MMGUNet defense

03 · Output heads & training

Dual head, three-term loss

Classification head

\(p(\text{fake})=\sigma\!\bigl(\mathbf{w}^{\top}\mathrm{Dropout}_{0.3}(\mathrm{AvgPool}(\mathbf{F}_{\text{fused}}))\bigr)\)

Video-level aggregation

\(s_{\text{video}}=\dfrac{1}{64}\sum_{i=1}^{64}p_i\)

U-Net decoder reuses ResNet skips \(x_0\dots x_3\) for a \(224\times224\) mask.
AdamW, lr \(10^{-4}\), warmup + cosine, 15 epochs, batch 32.

Total loss

\[\mathcal{L}=\mathcal{L}_{\text{cls}}+0.3\,\mathcal{L}_{\text{weak}}+0.5\,\mathcal{L}_{\text{sup}}\]

1.0

Classification

BCE with label smoothing 0.1. Carries the primary training signal.

0.3

Weak supervision

On the mean predicted mask. Low-information, so downweighted.

0.5

Strong supervision

Pixel-level BCE on approximate pseudo-masks, downweighted to protect accuracy.

Shukla R. · MMGUNet defense

04 · Experimental setup

Three datasets, three domains

FaceForensics++

Broadcast. 127K frames, 4 first/second-gen methods.

Celeb-DF

Social media. 179K frames, high-quality second-gen swap.

VCDF-X

Conferencing. 13,768 clips, 9 third-gen methods.

Metrics

AUC · primarymacro-F1EERIoU · localization

Baselines (same recipe)

ResNet-50MViT-V2-SViT-BaseSwin-V2-BMDF · FAU+rPPG

Identical recipe across conditions (optimizer, lr, schedule, seed 42). The only variable is the one each study isolates.

Shukla R. · MMGUNet defense

04 · Ablation 1 of 3 · backbone

ImageNet pretraining is the
single largest gain

Version	Key change	AUC	IoU
v1	Random initialization	0.939	0.672
v2	Balanced sampler + method head	0.913	0.643
v3	LSTM over 8 frames	0.865	0.513
v4	1-frame train, 64-frame infer	0.942	0.672
v5 · ResNet18	ImageNet pretrain	0.986	0.759
v5 · EffNet	EfficientNet-B0 pretrained	0.979	0.756
v6 · ViT-B/16	Vision Transformer	0.973	0.717
SWIN-T	SWIN + multi-scale masks	0.946	0.175

+0.047

AUC from random init → ImageNet pretrain (0.939 → 0.986)

CNNs beat transformers at this data scale.
LSTM hurts: 8 frames too short to learn dynamics.

3-modality architecture (RGB + FFT + STFT), single-dataset training on VCDF-X.

Shukla R. · MMGUNet defense

04 · Ablation 2 of 3 · modality

The landmark heatmap drives
the largest single gain

Modalities	VCDF-X	FF++	CelebDF	IoU
RGB only	0.962	0.917	0.984	0.701
+ FFT	0.974	0.926	0.989	0.722
+ STFT	0.972	0.937	0.993	0.731
+ Landmark (full)	0.999	0.933	0.993	0.956

+0.225

localization IoU from adding the landmark heatmap (0.731 → 0.956)

+0.027

VCDF-X AUC, concentrated on the conferencing domain

The explicit face prior decouples the face signal from background appearance, exactly where conferencing video differs most.

Shukla R. · MMGUNet defense

04 · Ablation 3 of 3 · the central finding

Standard benchmarks alone score
below chance on conferencing

Training data	VCDF-X	FF++	CelebDF
FF++ + CelebDF only	0.454	0.948	0.995
VCDF-X only	0.999	0.510	0.570
All three combined	0.999	0.933	0.993

Full 4-modal architecture. The only variable is the training-set composition.

0.454

AUC · FF++ & CelebDF → VCDF-X

Worse than a random classifier on conferencing video

Any conferencing detector must be trained on conferencing-domain data. Combined training closes the gap to a fraction of a percentage point on every dataset.

Shukla R. · MMGUNet defense

04 · Comparison with baselines

Matches or beats MDF on simpler inputs

Model	Train	Test	AUC	F1
In-domain (train & test on VCDF-X)
ResNet-50	VCDF-X	VCDF-X	0.974	0.975
Swin-V2-Base	VCDF-X	VCDF-X	0.997	0.975
MDF (FAU + rPPG)	VCDF-X	VCDF-X	0.980	0.907
MMGUNet (3-modal)	VCDF-X	VCDF-X	0.983	0.950
Combined training → per-domain test
MDF	All three	FF++	0.935	0.808
MDF	All three	CelebDF	0.998	0.981
MDF	All three	VCDF-X	0.975	0.907
MMGUNet (4-modal)	All three	FF++	0.933	0.944
MMGUNet (4-modal)	All three	CelebDF	0.993	0.988
MMGUNet (4-modal)	All three	VCDF-X	0.999	0.993

Shukla R. · MMGUNet defense

05 · Discussion

Three findings, three mechanisms

Pretraining dominates

ImageNet features regularize the encoder against dataset-specific noise. Larger gain than any architectural change.

Geometry prior helps most

The landmark heatmap decouples the face signal from background, exactly where conferencing differs.

Combined training is required

No implicit transfer from broadcast GAN content to conferencing diffusion content.

For the field: single-benchmark AUC above 0.99 is an in-distribution upper bound, not a deployment estimate. Cross-domain evaluation should be a default reporting requirement.

Shukla R. · MMGUNet defense

06 · Conclusion

Contributions & what comes next

VCDF-X

Joint

First conferencing-domain benchmark, 8 modalities. Submitted to ACM MM 2026.

MMGUNet

Mine

AUC 0.999 / 0.933 / 0.993 under combined training. Beats MDF on simpler inputs.

Generalization

Mine

First quantitative measure: single-source fails below chance on conferencing.

Future work

Integrate rPPG + FAU modalitiesAttention-based temporal modelingReal-time inference (quantization, distillation)LLM-grounded explanationsAdversarial robustness

Shukla R. · MMGUNet defense

Backup · per-method analysis

Mean score by source label

Source label	Mean score	N
Real	0.288	164
LTX-2 (AIGC)	0.786	157
Ovi (AIGC)	~	<30*
dlc (Inswapper+GFPGAN)	0.598	44
LiveAvatar (14B)	0.568	92
FaceFusion / VisoMaster / DreamID-V	~	<30*

Real well-separated from every fake label.
AIGC easiest: no natural acquisition noise to mimic.
Hardest: LiveAvatar (no blend boundary) and dlc (GFPGAN smooths the FFT signature).
* <30 videos: reported for completeness, not statistically interpretable.

Mean video-level score on a 500-clip subsample. Mean scores are not directly comparable to dataset-level AUC.

Shukla R. · MMGUNet defense

Backup · failure modes

Where the detector errs

Unusual lighting → fake

Low-light and backlit scenes: natural shadow boundaries read as blending artifacts.

Short LiveAvatar → real

Clips under 12s lack the temporal signature the model relies on; autoregressive output is incoherent early.

Heavy makeup → fake

Extreme appearance modifications and occlusion were unseen in training and look like artifacts.

These motivate the temporal-modeling and adversarial-robustness directions in future work.

Shukla R. · MMGUNet defense

Backup · threats to validity

Split granularity & in-domain leakage

The caveat

Clip-level random split does not guarantee source disjointness: long recordings were segmented into sibling clips, and swaps share a synthetic identity pool. In-domain AUC ≈ 0.999 is an optimistic upper bound.

Why the finding holds

The generalization claims are cross-dataset. Transfer failures are measured on FF++ and CelebDF, which share no source, identities, or pipelines with VCDF-X. Those numbers are independent of the VCDF-X split.

A source-level / identity-level partition of VCDF-X would tighten the in-domain figures. Left for future work.

Shukla R. · MMGUNet defense

Backup · limitations

Stated plainly

Simulated transmission: offline segmentation and post-hoc compression, not real Zoom/Teams pipelines.
Demographic imbalance: reflects source YouTube distribution; some subgroups underrepresented.
STFT semantics: non-standard on a flattened image; a 2D wavelet basis would be more principled.

No temporal modeling: mean aggregation discards temporal statistics; LSTM hurt.
Single-GPU scale: results are a single-GPU baseline, not a scale-frontier result.
Real-time gap: ~60-80 FPS on the training GPU; deployment needs quantization or a smaller backbone.

Shukla R. · MMGUNet defense

Backup · reproducibility

Compute & protocol

6×

RTX A6000, 48GB, Slurm (ITMO AI Talent Hub)

~11h

per configuration, single GPU, batch 32

uniformly-spaced frames at inference

epochs, AdamW, warmup + cosine

Dataset access was restricted to reviewers by the owner's policy. MMGUNet code and weights release with the camera-ready paper.

Shukla R. · MMGUNet defense

Multimodal Deepfake Detectionin Video Conferencing

The visual channel is no longerproof of identity

Wire fraud

Fake hires

Bypassed KYC

Social attacks

Conferencing video breaks theassumptions of existing benchmarks

Codec compression

Constrained pose

Low resolution

Virtual backgrounds

No prior dataset covers all threerequirements at once

Modern methods

Conferencing domain

Multimodal annotation

Three research problems

Dataset construction

Architecture design

Generalization analysis

What is mine, and what is the team's

VCDF-X dataset

MMGUNet + generalization study

VCDF-XThe dataset

A conferencing benchmark builton the current threat landscape

Nine modern pipelines, three families

Face swap

Reenactment

Generative

Eight modalities on every clip

Manipulation masks

Facial action units

Eye gaze

Depth maps

Face mesh

rPPG signal

NL descriptions

Demographics

Pseudo-masks from face-oval geometry

MMGUNetThe architecture

The MMGUNet architecture, end to end

Four complementary views of one frame

Raw RGB

FFT magnitude

STFT spectrogram

Landmark heatmap · NOVEL

A gate that amplifies, never zeroes

Standard sigmoid gate

Our \((1+\sigma)\) gate

Dual head, three-term loss

Classification

Weak supervision

Strong supervision

Experiments& results

Three datasets, three domains

FaceForensics++

Celeb-DF

VCDF-X

ImageNet pretraining is thesingle largest gain

The landmark heatmap drivesthe largest single gain

Standard benchmarks alone scorebelow chance on conferencing

Matches or beats MDF on simpler inputs

Three findings, three mechanisms

Pretraining dominates

Geometry prior helps most

Combined training is required

Contributions & what comes next

VCDF-X

MMGUNet

Generalization

Questions are welcome

Backupslides

Mean score by source label

Where the detector errs

Unusual lighting → fake

Short LiveAvatar → real

Heavy makeup → fake

Split granularity & in-domain leakage

The caveat

Why the finding holds

Stated plainly

Multimodal Deepfake Detection
in Video Conferencing

The visual channel is no longer
proof of identity

Conferencing video breaks the
assumptions of existing benchmarks

No prior dataset covers all three
requirements at once

VCDF-X
The dataset

A conferencing benchmark built
on the current threat landscape

MMGUNet
The architecture

Experiments
& results

ImageNet pretraining is the
single largest gain

The landmark heatmap drives
the largest single gain

Standard benchmarks alone score
below chance on conferencing

Backup
slides