Sophisticated machine learning models have transformed the translation process for the better, yet the issue of gender bias in machine outputs is becoming an increasing cause for concern.
Renowned instant machine translation tool Google Translate has faced ongoing scrutiny for lack of gender sensitivity when translating from a genderless language into a gendered language. For example, if in Finnish (a genderless language) <person A> invests and <person B> washes the laundry, Google Translate will automatically assign male and female pronouns: he invests and she washes the laundry.
While Google Translate produced this output, the machine translation system itself is not to blame. Instead, the artificial intelligence (AI) behind the machine translation is the perpetrator, and a machine learning model can only be as good as the data it learns from.
How does machine translation work?
Modern machine translation uses neural networks to automatically translate source text into a target language without human input. It learns to translate text based on bilingual corpora – training data – imported into machine translation models. The training data itself consists of millions of words and sentences previously translated by human translators. From this, a machine translation system learns how to apply and reproduce a language’s sentence structure, vocabulary, and syntax.
The root cause of gender bias in machine learning models relates to incorrect assumptions in the data algorithm. Human-produced data influences how the algorithm assigns gender pronouns, hence which the machine translation output then replicates.
Errors in the algorithm
Biased datasets stem from the inputs of computational linguists. It is not an active or conscious effort by model creators to produce gender biased translations, but instead a reflection of inherent social biases. As humans, we learn societal norms and values based on our environment and its social forces. By way of this learning, human social cognition is inherently biased.
Gender stereotypes and insensitivities embedded in the human psyche manifest in the machine learning model. Almost 90% of men and women hold bias against females, found a UN study. This is essentially the reason gender bias occurs and why machine translation will replicate it in a translation.
Biases are automatic and naturally occurring in the way that they help to make sense of complex realities. However, society nowadays is becoming arguably more aware of gender biases and its damaging impacts on women and other gender minorities and there is a greater understanding for the weight of gender pronouns. While unconscious bias cannot be undone, being aware of the problem and taking an active stance in disarming these associations is essential to improving training data.
Reducing gender bias in machine translation
There are no quick fixes to reducing gender bias in machine translation. It requires a long-term approach to slowly adjust datasets and teach machine learning models to become more gender sensitive.
So, how can you adjust datasets?
Modifying datasets: Datasets may be incomplete and/or skewed due to limited demographics making up the training corpora. Fixing the datasets themselves require a broader demographic with more female and other gender identities contributing to more inclusive and balanced outcomes.
This is however easier said than done, as datasets are made up of hundreds of millions of sentences. De-biasing the pre-existing data would be extremely time-consuming and costly, but the same concept can be applied on a smaller scale to retrain machine learning systems on new datasets. Overtime, the AI will re-learn based off the new datasets and create more balanced translations.
Human-in-the-loop approach: In addition to retraining translation AI, adopting a human-in-the-loop approach can help to even out imbalances. A human-in-the-loop approach in this instance refers to a review of the translated material to reduce machine errors. As society becomes more sensitive to the importance of gender inclusiveness, the corpus will become fairer and more equal overtime.
Gendered vs genderless languages
Translating from a genderless language, such as English, into gendered languages, such as French and Spanish present new challenges for de-biasing machine translation output. For instance, the sentence “the architect is talented” in English does not define the gender of the architect. Google Translate offers two translations in Spanish, one defining the female form and the other defining the male form.
It is impossible for machine translation to decipher a single, correct translation, because the English form does not specify whether the architect is male or female. In these instances, the machine translation relies on the training data to choose a single translation. The output will reflect the pronoun that statistically occurs more frequently within the training data. This is why human biases occur within the data and therefore why machine translation reflects the same bias. Objectively, men are more likely to become architects than women, so it is likely that there are more instances where “architect” is associated with being male rather than female in the training data. The machine translation would automatically reproduce the anomaly as the most commonly occurring association in the dataset, even if the architect in the source text was in fact female.
New and improved datasets
Gender biased outputs in machine translation can only be reduced or eliminated with the help of more balanced reference material. This requires ongoing and conscious efforts to include more diverse demographics producing the datasets for the machine translation system.