GeLU Activation

GeLU (Gaussian Error Linear Unit) is the activation function used in BERT, GPT-2, and most modern transformers. Unlike ReLU which hard-gates inputs, GeLU weights inputs by the probability they are positive under a Gaussian distribution.

Exact form: GeLU(x) = x * Φ(x) where Φ is the Gaussian CDF.

Since computing Φ exactly is expensive, PyTorch also provides a tanh approximation:

GeLU(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x³)))

Implement both forms. When approximate=False (default), use the exact form via scipy.special.erf. When approximate=True, use the tanh approximation.

Signature: def gelu(x: np.ndarray, approximate: bool = False) -> np.ndarray

Math

Asked at

GeLU Activation

Exact form: GeLU(x) = x * Φ(x) where Φ is the Gaussian CDF.

Since computing Φ exactly is expensive, PyTorch also provides a tanh approximation:

GeLU(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x³)))

Implement both forms. When approximate=False (default), use the exact form via scipy.special.erf. When approximate=True, use the tanh approximation.

Signature: def gelu(x: np.ndarray, approximate: bool = False) -> np.ndarray

Math

Asked at