Understanding and Mitigating Memorization in Language Models
Mansi Sakarvadia, University of Chicago
Language models (LMs) are able to "memorize" information: encode training data verbatim in its weights, such that the LM is able to regurgitate training data verbatim at inference time. Memorization reflects the model's inability to learn general patterns from datasets. Memorization is a problem as it leaves training data vulnerable to extraction at inference time. This is especially problematic in the event that private, sensitive, or incorrect training data is being extracted by a malicious third party. In this work, we investigate methods to mitigate memorization at training time, as well as methods to localize and remove memorized information from LM weights at inference time. We find that train-time mitigations are effective at curbing memorization, but at the cost of slight decrease in accuracy of target tasks. We find that post-training localization + removal of memorized information is a more effective measure to curb memorization as we are able to directly excise a small number of weights (<1%) while preserving model accuracy.