GeLU (Gaussian Error Linear Unit) is the activation function used in BERT, GPT-2, and most modern transformers. Unlike ReLU which hard-gates inputs, GeLU weights inputs by the probability they are positive under a Gaussian distribution.
Exact form: GeLU(x) = x * Φ(x) where Φ is the Gaussian CDF.
Since computing Φ exactly is expensive, PyTorch also provides a tanh approximation:
GeLU(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x³)))
Implement both forms. When approximate=False (default), use the exact form via scipy.special.erf. When approximate=True, use the tanh approximation.
Signature: def gelu(x: np.ndarray, approximate: bool = False) -> np.ndarray
Math
Asked at
Test Results