What causes bias in word embedding associations?
Word vectors are often criticized for capturing undesirable associations such as gender stereotypes. For example, according to the word embedding association test (WEAT), relative to artrelated terms, sciencerelated ones are significantly more associated with male attributes.
But what exactly is to blame? Biased training data, the embedding model itself, or just noise? Moreover, how effectively can we debias these embeddings? What theoretical guarantee is there, for example, that the subspace projection method for debiasing embeddings actually works?
In our ACL 2019 paper, “Understanding Undesirable Word Embedding Associations”, we answer some of these open questions. Specifically, we show that:

Debiasing skipgram and GloVe vectors using the subspace projection method (Bolukbasi et al., 2016) is, under certain conditions^{1}, equivalent to training on an unbiased corpus.

WEAT, the most common association test for word embeddings, can be easily “hacked” to claim that there is bias (i.e., a statistically significant association in a particular direction).

The relational inner product association (RIPA) is a much more robust alternative to WEAT.
Experiments with RIPA reveal that on average, skipgram does not make the vast majority of words any more gendered in the vector space than they are in the training corpus; individual words may be slightly more or less gendered due to noise. However, for words that are genderstereotyped (e.g., ‘nurse’) or genderspecific by definition (e.g., ‘queen’), skipgram actually amplifies the gender association in the corpus.
Provably Debiasing Embeddings
To debias word embeddings using the subspace projection method (Bolukbasi et al., 2016), we need to define a “bias subspace” in the embedding space and then subtract from each word vector its projection on this subspace. The inventors of this method created a bias subspace for gender by taking the first principal component of ten genderdefining relation vectors (e.g., $\vec{\textit{man}}  \vec{\textit{woman}}$).
To make any assertion about whether these debiased embeddings are actually unbiased, we need to first define what unbiasedness is. We do so by noting that although GloVe and skipgram learn vectors iteratively in practice, they are implicitly factorizing a wordcontext matrix containing a cooccurrence statistic^{2}.
Let $M$ denote the symmetric wordcontext matrix for a given training corpus that is implicitly factorized by the embedding model. Let $S$ denote a set of word pairs.
 A word $w$ is unbiased with respect to $S$ iff $\forall\, (x,y) \in S, M_{w,x} = M_{w,y}$.
 $M$ is unbiased with respect to $S$ iff $\forall\, w \not \in S$, $w$ is unbiased. A word $w$ or matrix $M$ is biased wrt $S$ iff it is not unbiased wrt $S$.
For example, the entire corpus would be unbiased with respect to {(‘male’, ‘female’)} iff $M_{\textit{w}, \text{male}}$ and $M_{\textit{w}, \text{female}}$ were interchangeable for any word $w$. Since $M$ is a wordcontext matrix, unbiasedness effectively means that the elements for ($w$, ‘male’) and ($w$, ‘female’) in $M$ can be switched without any impact on the word embeddings. Using this definition of unbiasedness, we prove the following:
Debiasing Theorem
For a set of word pairs $S$, let the bias subspace $B = \text{span}(\{ \vec{x}  \vec{y}\, \, (x,y) \in S\})$. For every word $w \not \in S$, let the debiased word vector . For any embedding model that implicitly factorizes $M$ into a word matrix $W$ and a context matrix $C$, the reconstructed wordcontext matrix $W_d C^T = M_d$ is unbiased with respect to $S$.
Lipstick on a Pig?
Although the subspace projection method is widely used for debiasing, Gonen and Goldberg (2019) observed that, in practice, gender can still be recovered from a “debiased” embedding space.
How can we reconcile this empirical observation with our theoretical findings above?

In practice, the bias subspace $B$ is defined as the first principal component of $\{ \vec{x}  \vec{y}\ \ (x,y) \in S\}$ instead of their span. The Debiasing Theorem is only applicable when $B$ is the span.

The word pairs $S$ are not exhaustive; they rarely contain all the directions in embedding space that capture gender. For example, if
and {(‘policeman’, ‘policewoman’)} $ \not\in S$, then the debiased word embeddings won’t be unbiased with respect to {(‘policeman’, ‘policewoman’)}. This would result in there being at least one direction in embedding space from which we could recover gender bias.
Though there are likely other factors at play, if only for these reasons, it is not surprising that what Gonen and Goldberg call the “lipstickonapig” problem persists in debiased embedding spaces.
Hacking WEAT
WEAT (Caliskan et al., 2016) is the most common test of bias in word embedding associations. In brief, it answers the following question: where relatedness is cosine similarity, are target words $T_1$ more related to attribute words $X$ than $Y$, relative to target words $T_2$?
For example, are sciencerelated terms more associated with male attributes and artrelated terms with female ones? According to the paper that proposed WEAT, yes, they are.
However, reasonable minds can disagree as to what exactly these ‘male’ and ‘female’ attribute sets should be. While it’s easy to construct many different – but equally valid – sets for each attribute, the choice of attribute set can lead to wildly different outcomes. We find that it is easy to contrive this selection to achieve a desired outcome, whether that is a statistically significant association in a particular direction (i.e., bias) or no significant association at all (i.e., no bias).
For example, is the word ‘door’ more male than femaleassociated, relative to the word ‘curtain’? According to WEAT, it depends on the attribute sets.
We use singleword sets above for the sake of simplicity: similar (statistically significant) outcomes can also be obtained with larger target and attribute word sets.^{3} Note that the problem here is not that some of these outcomes are unintuitive, but rather that the outcome can be easily manipulated by the person applying the test.
RIPA
We suggest using a more robust alternative to WEAT:
The relational inner product association (RIPA) of a word $w$ with respect to a relation vector $\vec{b}$ is $\beta(\vec{w}; \vec{b}) = \langle \vec{w}, \vec{b} \rangle$, where
 $S$ is a nonempty set of ordered word pairs $(x,y)$ that define the association
 $\vec{b}$ is the first principal component of $\{ \vec{x}  \vec{y}\ \ (x,y) \in S\}$
For example, if you wanted to calculate the genderedness of the word ‘door’, you would:
 Take the first principal component of $\vec{\textit{man}}  \vec{\textit{woman}}$, $\vec{\textit{king}}  \vec{\textit{queen}}$, etc. Call this $\vec{b}$.
 Calculate the dot product of $\vec{b}$ and $\vec{\textit{door}}$.
RIPA generalizes Bolukbasi et al.’s (2016) idea of measuring bias by projecting onto $\vec{\textit{he}}  \vec{\textit{she}}$ by replacing the difference vector with a relation vector $\vec{b}$. Using a relation vector – as opposed to, say, a higherdimensional subspace – makes RIPA interpretable: in the example above, positive RIPA values indicate male association and negative RIPA values indicate female association.
Why use RIPA?
RIPA is derived from the subspace projection method and is easy to calculate. Other than its simplicity, it has three advantages:

It is more robust to the choice of word pairs that define the association than WEAT is to its attribute word sets. Using (‘male’, ‘female’) instead of (‘man’, ‘woman’) to define the gender relation vector, for example, would have a negligible impact on $\vec{b}$. While RIPA is not completely immune to such changes, because of how $\vec{b}$ is defined, it is more robust than WEAT.

If only a single word pair $(x,y)$ defines the association, then the relation vector $\vec{b} = (\vec{x}  \vec{y}) /  \vec{x}  \vec{y} $, making RIPA highly interpretable. For example, for skipgram with negative sampling (SGNS), RIPA has the following statistical interpretation:
where csPMI refers to the cooccurrence shifted PMI^{4}, and $\lambda \approx 1, \alpha \approx 1$ in practice.

An obstacle to using any debiasing method is identifying which words we want to debias; we should debias ‘doctor’ but not ‘king’, for example, because the latter is genderspecific by definition. We propose a simple heuristic: debias a word $w$ iff
where $\vec{b_{\text{actual}}}$ is created using genderdefining pairs such as (‘man’, ‘woman’) and $\vec{b_{\text{stereo}}}$ is created using genderstereotypical pairs such as (‘doctor’, ‘midwife’).
Compared to Bolukbasi et al.’s (2016) supervised approach for finding which words to debias, our simple heuristic works much better.^{5} Note that we don’t change the debiasing method at all – we only change the selection of which words to debias.
Breaking Down Bias
Because RIPA has such a clear statistical interpretation for skipgram with negative sampling (SGNS), we can compare the genderedness in embedding space with the genderedness in the corpus to figure out $\Delta_g$, the change induced by the embedding model.
For example, if the relation vector $\vec{b}$ were defined only by the pair (‘man’, ‘woman’), then

the genderedness of word $w$ in the SGNS vector space would be

the genderedness of word $w$ in the corpus (a.k.a., RIPA in a noiseless SGNS space) would be:
We measure $\Delta_g$ as the change in absolute genderedness because some words may be more gendered in the embedding space than in the corpus, but in the opposite direction. In the paper, we estimate $\Delta_g$ with not just (‘man’, ‘woman’), but several other genderdefining word pairs.
As shown in red in the last row, genderneutral^{6} words, which comprise the vast majority of words in the vocabulary, are on average no more gendered in the embedding space than in the corpus. However, due to noise, individual words may be slightly more or less gendered in vector space.
In contrast, words that are genderstereotyped or genderspecific by definition are on average significantly more gendered in embedding space than in the training corpus. In the paper, we explore some theories for why this is the case.
Conclusion
Only words that are genderspecific (e.g, ‘queen’) or genderstereotyped (e.g., ‘nurse’) are on average more gendered in embedding space than in the training corpus; the vast majority of words are not. When measuring bias in word embedding associations, we recommend using RIPA instead of WEAT, as the latter’s outcomes can be easily contrived.
Once bias is identified, debiasing via subspace projection can, under certain conditions, be provably effective. However, these conditions are difficult to satisfy in practice, leading to some bias persisting in “debiased” embedding spaces. As NLP moves away from static embeddings to Transformerbased language models (e.g., BERT), leaving out training data responsible for undesirable associations (see Brunet et al., 2018) will likely be a more viable strategy than explicitly debiasing representations.
Acknowledgements
Many thanks to Krishnapriya Vishnubhotla, Jieyu Zhao, and Hila Gonen for their feedback on this blog post! This paper was coauthored with David Duvenaud and Graeme Hirst while I was at the University of Toronto.
Footnotes

These conditions are difficult to satisfy in practice, however, which allows some bias to persist in the debiased embedding space. This may partially explain the lipstickonapig problem (Gonen and Goldberg, 2019). ↩

For details about the implicit factorization done by skipgram, refer to Levy and Goldberg (2014). ↩

For example, {‘door’, ‘cat’, ‘shrub’} is significantly more femaleassociated than {‘curtain’, ‘dog’, ‘tree’} when the attribute words are {‘woman’, ‘womanly’, ‘female’} and {‘man’, ‘manly’, ‘male’} ($p < 0.001$). Larger word sets are associated with smaller effect sizes but still yield highly statistically significant WEAT outcomes. ↩

In brief, $\text{csPMI}(x,y) = \text{PMI}(x,y) + \log p(x,y)$. See my previous blog post for a more detailed discussion of the cooccurrence shifted PMI. ↩

For a fair comparison with the supervised approach, we use the methodology in Bolukbasi et al. (2016) regarding gendered word analogies for evaluating the usefulness of the RIPAbased heuristic. However, the focus of our paper is on bias in word embedding associations, not word analogies. ↩

We identified genderappropriate (e.g., ‘king’) and genderbiased (e.g, ‘nurse’) words by plucking them from the genderappropriate and genderbiased word analogies in Bolukbasi et al. (2016), which were constructed with the help of crowdworkers. We considered all words that did not fall into these categories to be genderneutral – neither genderspecific by definition nor associated with any gender stereotypes. See the paper for details. ↩