Masked LM Vs Causal LM

Created: 2022-08-21 11:24
#note

MLM: task of predicting the masked items (they are randomly chosen), so both the right and the left contexts need to be used.
CLM: task of predicting the next token, so the model is just concerned with the left context.

MLM loss is preferred when the goal is to learn a good representation of the input document, whereas, CLM is mostly preferred when we wish to learn a system that generates fluent text. Also, intuitively this makes sense, because while learning good input representation for every word you would want to know the words that occur to it’s both left and right, whereas, when you want to learn a system that generates text, you can only see what all you have generated till now (it’s just like how humans write as well). So, making a system that could peek to the other side as well while generating text can introduce bias limiting the creative ability of the model.

References

Towards Data Science

Masked LM Vs Causal LM

References

Tags