Wolfgang Macherey
Wolfgang Macherey joined Google in 2006 as a research scientist, where
he works in the machine translation group with Franz Josef Och. He has
been working on natural language processing since 1996.
Wolfgang worked as a Research Assistant at RWTH Aachen University from 1999 to 2005. His main research interests are in statistical machine translation and automatic speech recognition with the focus on discriminative training methods, natural language processing, statistical pattern recognition, and machine learning.
He received a PhD in Computer Science from RWTH Aachen University, Germany, in 2010 and his Diploma Degree in Computer Science from RWTH Aachen University in 1999 with a major in statistical pattern recognition and a minor in physical chemistry and thermodynamics.
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Preview abstract
We present Mu2SLAM, a multilingual sequence-to-sequence model pre-trained jointly on un-labeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition(ASR), Automatic Speech Translation (AST)and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu2SLAM trains ona sequence-to-sequence masked denoising objective similar to T5 on both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On CoVoSTAST, Mu2SLAM establishes a new state-of-the-art for models trained on public datasets, improv-ing on xx-en translation over the previous best by 1.9 Bleu points and on en-xx translation by 0.9 Bleu points. On Voxpopuli ASR, our model matches the performance of a mSLAM model finetuned with a RNN-T decoder, despite using a relatively weaker sequence-to-sequence architecture. On text understanding tasks, our model improves by more than 6% over mSLAM on XNLI, getting closer to the performance of mT5 models of comparable capacity on XNLI and TydiQA, paving the way towards a single model for all speech and text understanding tasks.
View details
Spae: Semantic pyramid autoencoder for multimodal generation with frozen LLMs
Lijun Yu
Zhiruo Wang
Yonatan Bisk
Alex Hauptmann
Lu Jiang
NeurIPS (2023)
Preview abstract
In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.
View details
Preview abstract
Multilingual neural machine translation (NMT) typically learns to maximize the likelihood of training examples from a combination set of multiple language pairs. However, this mechanical combination only relies on the basic sharing to learn the inductive bias, which undermines the generalization and transferability of multilingual NMT models. In this paper, we introduce a multilingual crossover encoder-decoder (mXEnDec) to fuse language pairs at instance level to exploit cross-lingual signals. For better fusions on multilingual data, we propose several techniques to deal with the language interpolation, dissimilar language fusion and heavy data imbalance. Experimental results on a large-scale WMT multilingual data set show that our approach significantly improves model performance on general multilingual test sets and the model transferability on zero-shot test sets (up to $+5.53$ BLEU).
Results on noisy inputs demonstrates the capability of our approach to improve model robustness against the code-switching noise. We also conduct qualitative and quantitative representation comparisons to analyze the advantages of our approach at the representation level.
View details
Building Machine Translation Systems for the Next Thousand Languages
Julia Kreutzer
Mengmeng Niu
Pallavi Nikhil Baljekar
Xavier Garcia
Maxim Krikun
Pidong Wang
Apu Shah
Macduff Richard Hughes
Google Research (2022)
Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation
George Foster
David Grangier
Viresh Ratnakar
Qijun Tan
Transactions of the Association for Computational Linguistics, vol. 9, pp. 1460-1474
Preview abstract
Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly-accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research.
View details
Preview abstract
Recently, self-supervised pre-training of text representations has been success-fully applied to low-resource Neural Machine Translation (NMT). However, it usually fails to achieve dramatic success on resource-rich NMT. In this paper, we propose a joint training approach, F2-XEnDec, to jointly self-supervised and supervised train NMT models. To this end, a new task called crossover encoder-decoder (XEnDec) is designed to entangle their representations. The key idea is to combine pseudo parallel sentences (also generated byXEnDec)) used in self-supervised training and parallel sentences in supervised training through a second crossover. Experiments on two resource-rich translation benchmarks, WMT’14English-German and English-French, demonstrate our approach achieve substantial improvements over the Transformer. We also show that our approach is capable of improving the model robustness against input perturbations, in particular for code-switched perturbations.
View details
Re-translation versus Streaming for Simultaneous Translation
Naveen Ari
George Foster
IWSLT 2020, Association for Computational Linguistics
Preview abstract
There has been great progress in improving streaming machine translation, a simultaneous paradigm where the system appends to a growing hypothesis as more source content becomes available. We study a related problem in which revisions to the hypothesis beyond strictly appending words are permitted. This is suitable for applications such as live captioning an audio feed. In this setting, we compare custom streaming approaches to re-translation, a straightforward strategy where each new source token triggers a distinct translation from scratch. We find re-translation to be as good or better than state-of-the-art streaming systems, even when operating under constraints that allow very few revisions. We attribute much of this success to a previously proposed data-augmentation technique that adds prefix-pairs to the training data, which alongside wait-k inference forms a strong baseline for streaming translation. We also highlight re-translation's ability to wrap arbitrarily powerful MT systems with an experiment showing large improvements from an upgrade to its base model.
View details
Preview abstract
We investigate the problem of simultaneous machine translation of long-form speech content. We target a continuous speech-to-text scenario, generating translated captions for a live audio feed, such as a lecture or play-by-play commentary. As this scenario allows for revisions to our incremental translations, we adopt a re-translation approach to simultaneous translation, where the source is repeatedly translated from scratch as it grows. This approach naturally exhibits very low latency and high final quality, but at the cost of incremental instability as the output is continuously refined. We experiment with a pipeline of industry-grade speech recognition and translation tools, augmented with simple inference heuristics to improve stability. We use TED Talks as a source of multilingual test data, developing our techniques on English-to-German spoken language translation. Our minimalist approach to simultaneous translation allows us to easily scale our final evaluation to six more target languages, dramatically improving incremental stability for all of them.
View details
Preview abstract
In this paper, we propose a new adversarial augmentation method for Neural Machine Translation (NMT). The main idea is to minimize the vicinal risk over virtual sentences sampled from two vicinity distributions, in which the crucial one is a novel vicinity distribution for adversarial sentences that
describes a smooth interpolated embedding space centered around observed training sentence pairs. We then discuss our approach, AdvAug, to train NMT models using the embeddings of virtual sentences in sequence-tosequence learning. Experiments on ChineseEnglish, English-French, and English-German
translation benchmarks show that AdvAug achieves significant improvements over the Transformer (up to 4.9 BLEU points), and substantially outperforms other data augmentation techniques (e.g. back-translation) without using extra corpora.
View details
KoBE: Knowledge-Based Machine Translation Evaluation
Findings of EMNLP (2020)
Preview abstract
We propose a simple and effective method for machine translation evaluation which does not require reference translations. Our approach is based on (1) grounding the entity mentions found in each source sentence and candidate translation against a large-scale multilingual knowledge base, and (2) measuring the recall of the grounded entities found in the candidate vs. those found in the source. Our approach achieves the highest correlation with human judgements on 9 out of the 18 language pairs from the WMT19 benchmark for evaluation without references, which is the largest number of wins for a single evaluation method on this task. On 4 language pairs, we also achieve higher correlation with human judgements than BLEU. To foster further research, we release a dataset containing 1.8 million grounded entity mentions across 18 language pairs from the WMT19 metrics track data.
View details
Monotonic Infinite Lookback Attention for Simultaneous Machine Translation
Naveen Ari
Chung-Cheng Chiu
Semih Yavuz
Ruoming Pang
Wei Li
Colin Raffel
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Association for Computational Linguistics, Florence, Italy (2019), pp. 1313-1323
Preview abstract
Simultaneous machine translation begins to translate each source sentence before the source speaker is finished speaking, with applications to live and streaming scenarios. Simultaneous systems must carefully schedule their reading of the source sentence to balance quality against latency. We present the first simultaneous translation system to learn an adaptive schedule jointly with a neural machine translation (NMT) model that attends over all source tokens read thus far. We do so by introducing Monotonic Infinite Lookback (MILk) attention, which maintains both a hard,monotonic attention head to schedule the read-ing of the source sentence, and a soft attention head that extends from the monotonic head back to the beginning of the source. We show that MILk’s adaptive schedule allows it to arrive at latency-quality trade-offs that are favorable to those of a recently proposed wait-k strategy for many latency values.
View details
Preview abstract
Neural machine translation (NMT) suffers from the vulnerability to noisy perturbations in the input, which can cause a model trained on the clean data to behave abnormally on the noisy input. We propose an approach to improving the robustness of NMT models, which consists of two parts: (1) attack the translation model with adversarial source examples; (2) defend the translation model with adversarial target input to be robust against adversarial source input. For the generation of adversarial input, we propose to use a gradient-based method to craft adversarial examples that are advised by the translation loss in NMT based on the clean input. Experimental results on Chinese-English and English-German translation tasks demonstrate that our approach achieves significant improvements on the standard clean data and performs robustness on the noisy data.
View details
Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
Dmitry (Dima) Lepikhin
George Foster
Maxim Krikun
Naveen Ari
(2019)
Preview abstract
We introduce our efforts towards building a universal neural machine translation (NMT) system capable of translating between any language pair. We set a milestone towards this goal by building a single massively multilingual NMT model handling 103 languages trained over 25 billion examples. Our system demonstrates effective transfer learning ability, significantly improving translation quality of low-resource languages, while keeping high-resource language translation quality on-par with competitive bilingual baselines. We provide in-depth analysis of various aspects of model building that are crucial to the quality and practicality towards universal NMT. While we prototype a high-quality universal translation system, our extensive empirical analysis exposes issues that need to be further addressed, and we suggest directions for future research.
View details
Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation
Ye Jia
Chung-Cheng Chiu
Naveen Ari
Stella Marie Laurenzo
ICASSP (2019)
Preview abstract
End-to-end Speech Translation (ST) models have many potential advantages when compared to the cascade of Automatic Speech Recognition (ASR) and text Machine Translation (MT) models, including
lowered inference latency and the avoidance of error compounding.
However, the quality of end-to-end ST is often limited by a paucity
of training data, since it is difficult to collect large parallel corpora of
speech and translated transcript pairs. Previous studies have proposed
the use of pre-trained components and multi-task learning in order
to benefit from weakly supervised training data, such as speech-totranscript or text-to-foreign-text pairs. In this paper, we demonstrate
that using pre-trained MT or text-to-speech (TTS) synthesis models
to convert weakly supervised data into speech-to-translation pairs for
ST training can be more effective than multi-task learning. Furthermore, we demonstrate that a high quality end-to-end ST model can
be trained using only weakly supervised datasets, and that synthetic
data sourced from unlabeled monolingual text or speech can be used
to improve performance. Finally, we discuss methods for avoiding
overfitting to synthetic speech with a quantitative ablation study.
View details
Direct speech-to-speech translation with a sequence-to-sequence model
Ye Jia
Interspeech (2019)
Preview abstract
We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice). We further demonstrate the ability to synthesize translated speech using the voice of the source speaker. We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this very challenging task.
View details
Revisiting Character-Based Neural Machine Translation with Capacity and Compression
George Foster
Empirical Methods in Natural Language Processing (2018)
Preview abstract
Translating characters instead of words or word-fragments has the potential to simplify the processing pipeline for neural machine translation (NMT), and improve results by eliminating hyper-parameters and manual feature engineering. However, it results in longer sequences in which each symbol contains less information, creating both modeling and computational challenges. In this paper, we show that the modeling problem can be solved by standard sequence-to-sequence architectures of sufficient depth, and that deep models operating at the character level outperform identical models operating over word fragments. This result implies that alternative architectures for handling character input are better viewed as methods for reducing computation time than as improved ways of modeling longer sequences. From this perspective, we evaluate several techniques for character-level NMT, verify that they do not match the performance of our deep character baseline model, and evaluate the performance versus computation time tradeoffs they offer. Within this framework, we also perform the first evaluation for NMT of conditional computation over time, in which the model learns which timesteps can be skipped, rather than having them be dictated by a fixed schedule specified before training begins.
View details
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation
George Foster
Llion Jones
Macduff Hughes
Mike Schuster
Niki J. Parmar
ACL'18 (2018) (to appear)
Preview abstract
The past year has witnessed rapid advances in sequence-to-sequence (seq2seq)
modeling for Machine Translation (MT). The classic RNN-based approaches to MT
were first out-performed by the convolutional seq2seq model, which was then
out-performed by the more recent Transformer model. Each of these new
approaches consists of a fundamental architecture accompanied by a set of
modeling and training techniques that are in principle applicable to other
seq2seq architectures. In this paper, we tease apart the new architectures and
their accompanying techniques in two ways. First, we identify several key
modeling and training techniques, and apply them to the RNN architecture,
yielding a new RNMT+ model that outperforms all of the three fundamental architectures
on the benchmark WMT'14 English to French and
English to German tasks. Second, we analyze the properties of each
fundamental seq2seq architecture and devise new hybrid architectures intended
to combine their strengths. Our hybrid models obtain further improvements,
outperforming the RNMT+ model on both benchmark datasets.
View details
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Mike Schuster
Mohammad Norouzi
Maxim Krikun
Qin Gao
Apurva Shah
Xiaobing Liu
Łukasz Kaiser
Stephan Gouws
Taku Kudo
Keith Stevens
George Kurian
Nishant Patil
Wei Wang
Jason Smith
Alex Rudnick
Macduff Hughes
CoRR, vol. abs/1609.08144 (2016)
Preview abstract
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.
View details
Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
Preview
Chris Dyer
Franz Och
Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, ACL and AFNLP (2009), pp. 163-171
Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation
Franz Och
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 620-629
Preview abstract
We present Minimum Bayes-Risk (MBR) decoding over translation lattices that compactly encode
a huge number of translation hypotheses. We describe conditions on the loss function
that will enable efficient implementation of MBR decoders on lattices. We introduce
an approximation to the BLEU score~\cite{papineni01} that satisfies these conditions. The MBR decoding under this approximate BLEU is realized using Weighted Finite State Automata. Our experiments show that the Lattice MBR decoder yields moderate, consistent gains in
translation performance over N-best MBR decoding on Arabic-to-English, Chinese-to-English and English-to-Chinese translation tasks. We conduct a range of experiments to
understand why Lattice MBR improves upon N-best MBR and also study the impact of various parameters on MBR performance.
View details
Lattice-based Minimum Error Rate Training for Statistical Machine Translation
Franz Och
Ignacio Thayer
Jakob Uszkoreit
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, pp. 725-734
Preview abstract
Minimum Error Rate Training (MERT) is an effective means to estimate the feature function weights of a linear model such that an automated evaluation criterion for measuring system performance can directly be optimized in training. To accomplish this, the training procedure determines for each feature function its exact error surface on a given set of candidate translations. The feature function weights are then adjusted by traversing the error surface combined over all sentences and picking those values for which the resulting error count reaches a minimum. Typically, candidates in MERT are represented as N-best lists which contain the N most probable translation hypotheses produced by a decoder. In this paper, we present a novel algorithm
that allows for efficiently constructing and representing the exact error surface of all translations that are encoded in a phrase lattice. Compared to N-best MERT, the number of candidate translations thus taken into account increases by several orders of magnitudes. The proposed method is used to train the feature function weights of a phrase-based statistical machine translation system. Experiments
conducted on the NIST 2008 translation tasks show significant runtime improvements and moderate BLEU score gains over N-best MERT.
View details
An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems
Franz J. Och
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Association for Computational Linguistics, 209 N. Eighth Street, East Stroudsburg, PA, USA, pp. 986-995
Preview abstract
This paper presents an empirical study on how different selections of input translation systems affect translation quality in system combination. We give empirical evidence that the systems to be combined should be of similar quality and need to be almost uncorrelated in order to be beneficial for system combination. Experimental results are presented for composite translations computed from large numbers of different research systems as well as a set of translation systems derived from one of the best-ranked machine translation engines in the 2006 NIST machine translation evaluation.
View details
Improving Word Alignment with Bridge Languages
Franz Och
Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics, 209 N. Eighth Street, East Stroudsburg, PA, USA (2007)
Preview abstract
We describe an approach to improve Statistical Machine Translation (SMT) performance using multi-lingual,
parallel, sentence-aligned corpora in several bridge languages. Our approach consists of a simple method for utilizing a bridge language to create a word alignment system and a procedure for combining word alignment systems from multiple bridge languages. The final translation is obtained by consensus decoding that combines
hypotheses obtained using all bridge language word alignments. We present experiments showing that multilingual, parallel text in Spanish, French, Russian,
and Chinese can be utilized in this framework to improve translation performance on an Arabic-to-English task.
View details
Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition
Lars Haferkamp
Ralf Schlueter
Hermann Ney
Europ. Conf. on Speech Communication and Technology (2005), pp. 2133-2136
Minimum Exact Word Error Training
Ralf Schlueter
Hermann Ney
Automatic Speech Recognition and Understanding (2005), pp. 186-190
Discriminative Training with Tied Covariance Matrices
Ralf Schlueter
Hermann Ney
8th Int. Conf. on Spoken Language Processing (ICSLP) (2004), pp. 681-684
Adaptation in Statistical Pattern Recognition Using Tangent Vectors
Hermann Ney
Joerg Dahmen
IEEE Trans. Pattern Analysis Machine Intelligence, vol. 26 (2004), pp. 269-274
Probabilistic Aspects in Spoken Document Retrieval
Joerg Viechtbauer
Hermann Ney
EURASIP Journal on Applied Signal Processing (2003), pp. 115-127
A Comparative Study on Maximum Entropy and Discriminative Training for Acoustic Modeling in Automatic Speech Recognition
Hermann Ney
Proc. European Conference on Speech Communication and Technology (2003), pp. 493-496
Towards Automatic Corpus Preparation for a German Broadcast News Transcription System
Probabilistic Retrieval Based On Document Representations
Joerg Viechtbauer
Hermann Ney
Int. Conf. on Spoken Language Processing (2002), pp. 1481-1484
Comparison of Discriminative Training Criteria and Optimization Methods for Speech Recognition
Ralf Schlueter
Boris Mueller
Hermann Ney
Speech Communication, vol. 34 (2001), pp. 287-310
Learning of Variability for Invariant Statistical Pattern Recognition
Joerg Dahmen
Hermann Ney
European Conference on Machine Learning (ECML) (2001), pp. 263-275
Improving Automatic Speech Recognition Using Tangent Distance
Joerg Dahmen
Hermann Ney
European Conference on Speech Communication and Technology (2001)
A Combined Maximum Mutual Information and Maximum Likelihood Approach for Mixture Density Splitting
Ralf Schlueter
Boris Mueller
Hermann Ney
Europ. Conf. on Speech Communication and Technology (1999), pp. 1715-1718
Comparison of Discriminative Training Criteria
Ralf Schlueter
Int. Conf. on Acoustics, Speech, and Signal Processing (1998), pp. 493-496
Implementierung und Vergleich diskriminativer Verfahren fuer Spracherkennung bei kleinem Vokabular
RWTH Aachen University (1998), pp. 123
Comparison of Optimization Methods for Discriminative Training Criteria
Ralf Schlueter
Stephan Kanthak
Hermann Ney
Lutz Welling
Europ. Conf. on Speech Communication and Technology (1997), pp. 15-18