BERT MLM Masking — TorchedUp

Masked Language Modeling Mask

Implement BERT's MLM masking procedure.

Signature: def apply_mlm_mask(token_ids: list, mask_id: int, vocab_size: int, mask_prob: float = 0.15, seed: int = 0) -> tuple

Returns (masked_tokens, labels) where both are lists of length len(token_ids).

Procedure:

np.random.seed(seed)
Draw select = np.random.rand(n) — token i is selected for prediction if select[i] < mask_prob
Draw op = np.random.rand(n) — for selected tokens:
- op[i] < 0.8 → replace with mask_id
- 0.8 <= op[i] < 0.9 → replace with a random token from np.random.randint(0, vocab_size, size=n)
- op[i] >= 0.9 → leave the token unchanged
labels[i] = token_ids[i] for selected positions, -100 otherwise

Draw the three random arrays in this order: select, then op, then rand_tok (call np.random.randint(0, vocab_size, size=n) once at the start, after the two rand calls).

Math

Asked at

Masked Language Modeling Mask

Implement BERT's MLM masking procedure.

Signature: def apply_mlm_mask(token_ids: list, mask_id: int, vocab_size: int, mask_prob: float = 0.15, seed: int = 0) -> tuple

Returns (masked_tokens, labels) where both are lists of length len(token_ids).

Procedure:

np.random.seed(seed)
Draw select = np.random.rand(n) — token i is selected for prediction if select[i] < mask_prob
Draw op = np.random.rand(n) — for selected tokens:
- op[i] < 0.8 → replace with mask_id
- 0.8 <= op[i] < 0.9 → replace with a random token from np.random.randint(0, vocab_size, size=n)
- op[i] >= 0.9 → leave the token unchanged
labels[i] = token_ids[i] for selected positions, -100 otherwise

Draw the three random arrays in this order: select, then op, then rand_tok (call np.random.randint(0, vocab_size, size=n) once at the start, after the two rand calls).

Math

Asked at