多模态本地化的崛起:在全球化时代驾驭视频与音频内容
11月8日至10日,语言服务创新发展国际(厦门)论坛暨中国翻译协会翻译服务委员会2024年会在厦门举办。Nimdzi Insights公司CEO Josef Kubovsky在参会后发表了三篇长文,展示了其对语言服务行业的深度思考与见解。本文为第二篇,聚焦时下大热的生成式人工智能在本地化中的应用。
(图源:视觉中国)
The Rise of multimodal Localization:
Navigating Video and Audio Content in a Global Era
多模态本地化的崛起:
在全球化时代驾驭视频与音频内容
Josef Kubovsky
2024 年 11 月 29 日
Just two days ago, l shared insights from my time at TAC LSC 2024 in Xiamen, focusing on generative AI’s role in localization and the importance of cultural sensitivity.Today, I want to expand on another key theme from the conference: the rise of multimodal localization.
两天前,我分享了我在厦门参加TAC LSC 2024的见闻,重点探讨了生成式人工智能在本地化中的作用以及文化敏感性的必要性。今天,我想进一步探讨会议中的另一个重要主题:多模态本地化的崛起。
The rise of short-form video platforms like Douyin and TikTok has revolutionized how people consume content globally. As videos become the primary mode of communication, businesses are racing to make their multimedia accessible to international audiences. But video and audio localization present unique challenges that go beyond traditional text translation.
短视频平台如抖音和TikTok的兴起已经彻底改变了世界各地人们的内容消费方式。随着视频成为主要的沟通方式,各大企业正竞相向国际受众提供多媒体内容。然而,视频和音频的本地化呈现出不同于传统文本翻译的独特挑战。
At TAC LSC 2024, several panels, including “Future-Proofing Multimedia Localization”, tackled this topic. It became clear that multimodal localization involves far more than simply translating subtitles. It includes:
在TAC LSC 2024大会上,包括“未来多媒体本地化”在内的多个专题都深入探讨了这一议题。很明显,多模态本地化不仅仅是翻译字幕,还涉及:
Dubbing: Synchronizing translated dialogue with lip movements and emotions.
配音:将翻译的对话与角色的口型和情感同步。
Subtitling: Condensing spoken language into readable, culturally relevant text.
字幕:将口语内容简化为可读且符合文化背景的文本。
Voiceovers: Creating narration that fits both tone and context.
旁白:创造既符合语气又契合语境的叙述。
Cultural Adaptation: Modifying visual and auditory cues to align with target audience expectations.
文化适配:调整视觉和听觉元素以符合目标受众的期望。
A memorable case study shared at the event illustrated these challenges. A Chinese drama being localized for Western audiences struggled with adapting visual and cultural elements. While the Al-powered transcription and subtitle generation were seamless, emotional nuances in the voiceover failed to resonate with Western viewers until human editors stepped in to refine them.
一个令人印象深刻的案例研究揭示了这些挑战。一部为西方观众本地化的中国电视剧在调整视觉和文化元素时遇到困难。尽管AI驱动的转录和字幕生成无缝完成,但配音的情感细腻度未能打动西方观众,直到人类编辑介入优化。
▲ 上海唐能翻译咨询有限公司大客户经理齐慧琳
《影视剧出口字幕翻译服务实践》
Advancements in generative AI have significantly improved the scalability of video and audio localization. Tools powered by large language models (LLMs) can now automate:
生成式人工智能的进步显著提升了视频和音频本地化的可扩展性。基于大型语言模型(LLM)的工具如今可以自动完成:
1. Transcription 转录:
Generating accurate scripts from audio, even for complex dialogue.
从音频中生成准确的脚本,即使是复杂对话也不在话下。
2. Translation 翻译:
Converting scripts into multiple languages in a fraction of the time.
在短时间内将脚本转化为多种语言。
3. Synthetic Dubbing 合成配音:
Using AI-generated voices to match the tone and pace of original speakers.
利用 AI 生成与原始说话者语气、节奏相匹配的声音。
For example, at TAC LSC 2024, a demonstration by a Chinese tech firm highlighted how Al-enabled dubbing reduced the localization timeline for a 10-episode series by nearly 50%. However, as several speakers noted, these tools often stumble when handling cultural context or delivering emotional fidelity.
例如,在TAC LSC 2024会议上,一家中国科技公司展示了通过AI配音将一部10集电视剧的本地化时间缩短近 50%。然而,正如多位发言者指出的那样,这些工具在处理文化语境或传递真情实感时仍然存在不足。
Despite the efficiency of AI, human expertise remains indispensable inmultimodal localization, particularly in high-stakes projects.
尽管人工智能效率高,但在多模态本地化中,尤其是高风险项目中,人类专家的作用仍不可替代。
● Cultural Nuances in Voiceovers
语音配音中的文化细微差异
During one panel, speakers discussed how literal translations of Chinese idioms often lead to awkward voiceovers in English. For example, the phrase “借花献佛” (“offer flowers to the Buddha,” meaning to give gifts using someone else’s resources) was rendered mechanically as “borrowing flowers for Buddha,” losing its metaphorical meaning entirely.
在一个专题讨论中,专家提到,将中文成语直译成英文常导致生硬的配音效果。例如,“借花献佛”(指用别人的资源送礼)被机械地翻译为“borrowing flowers for Buddha”(给佛祖献上借来的花),完全丧失了其隐喻意义。
Human voice artists and editors ensure that such expressions are adapted into culturally appropriate equivalents, preserving the impact of the content.
配音演员和人工编辑能将这些表达调整为文化适配的等效语,从而保留内容的情感冲击力。
● Emotional Resonance in Dubbing
配音中的情感共鸣
Synthetic voices, while advanced, still lack the emotional depth required for certain genres, such as dramas or documentaries. One presenter shared an instance where a synthetic voice in a translated documentary about rural Chinese traditions failed to convey the reverence intended for the subject matter. Human intervention corrected the tone, ensuring it aligned with audience expectations.
尽管合成声音技术已十分先进,但在处理特定类型(如戏剧或纪录片)时,仍然缺乏传递深层情感的能力。演讲者分享了一个例子:在一部关于中国农村传统的纪录片中,合成配音未能传达出对该主题应有的敬意。最终由人工介入调整语气,确保符合观众的期望。
Just as in text translation, hybrid workflows are proving to be the key to success in multimodal localization. These workflows combine AI’s speed and scalability with the creativity and cultural awareness of human professionals.
正如在文本翻译中一样,混合工作流在多模态本地化中也证明是成功的关键。这种工作流结合了 AI 的速度与可扩展性以及人类专家的创造力与文化意识。
How Hybrid Multimedia Localization Works
混合多媒体本地化的运作方式
1. AI Pre-Processing
AI 预处理
Machines generate transcriptions, subtitles, and initial dubbing tracks.
机器生成转录、字幕和初步配音。
2. Human Refinement
人工精修
Linguists and cultural consultants review AI outputs for accuracy and emotional depth.
语言学家和文化顾问审查 AI 输出结果,确保准确性和情感深度。
3. Final Quality Control
最终质量控制
Specialized teams ensure synchronization, linguistic fidelity, and cultural alignment.
专业团队确保同步性、语言一致性和文化适配性。
A standout example from TAC LSC 2024 was a Chinese gaming company localizing an interactive fantasy game for European markets. AI generated subtitles and automated dubbing for over 30,000 in-game dialogue lines. Human editors then tailored the content, replacing culturally specific Chinese folklore references with European mythology. The hybrid approach reduced project costs by 40% while delivering a localized game that felt native to its target audience.
一个来自TAC LSC 2024的亮点案例是,一家中国游戏公司为欧洲市场本地化一款互动奇幻游戏。AI生成了字幕并为超过30,000条游戏对话提供了自动配音。随后,人类编辑根据文化需求调整内容,将特定的中国民俗替换为欧洲神话。这种混合方法将项目成本降低了 40%,同时交付了一款贴近目标市场文化的本地化游戏。
As video and audio content continue to dominate global communication, multimodal localization is no longer optional – it’s essential. The challenge lies in balancing the efficiency of AI with the emotional and cultural depth that only humans can provide.
随着视频和音频内容持续主导全球交流,多模态本地化已不再是可选项,而成为了必选项。挑战在于如何在AI的效率与只有人类才能提供的情感和文化深度之间找到平衡。
来源 | TAC-LSC 2024 Summit
制作|绢生
审核|肖英 / 万顷
终审 | 清欢