API
transformers_lab.feed_forward
Feed forward neural network module.
FeedForward
Position-wise Feed Forward Network used in Transformer.
__init__(d_model, d_ff, init_weight_fn=xavier_init)
Initialize the feed-forward module.
Parameters
d_model : int Model dimensionality d_ff : int Hidden layer dimensionality init_weight_fn : Callable[[int, int], np.ndarray] Weight initialization function. Defaults to xavier_init.
forward(x)
Run forward pass.
Parameters
x : np.ndarray Shape (seq_len, d_model)
Returns:
np.ndarray Shape (seq_len, d_model)
relu(x)
ReLU activation function.
xavier_init(n1, n2, n_heads=None)
Xavier Glorot initialization for weight matrices.
Parameters
n1 : int Input dimension. n2 : int Output dimension. n_heads : int, optional If provided, returns shape (n_heads, n1, n2). If None, returns shape (n1, n2).
Returns:
np.ndarray Shape (n_heads, n1, n2) or (n1, n2).
transformers_lab.layer_norm
layer normalisation implementation.
LayerNorm
Layer normalization implementation.
__call__(x)
Apply layer normalization.
Parameters
x : np.ndarray Input tensor of shape (..., d_model).
Returns:
np.ndarray Normalized tensor with the same shape as input.
__init__(d_model, eps=1e-06)
Initialize the LayerNorm module.
Parameters
d_model : int Dimensionality of the input. eps : float, optional Value added to the denominator for numerical stability. Default is 1e-6.
Attributes:
gamma : np.ndarray Scale parameter of shape (d_model,). beta : np.ndarray Shift parameter of shape (d_model,). eps : float Numerical stability constant.
transformers_lab.multihead_attention
Compute multi-head-attention.
multi_head_attention(x, w_q, w_k, w_v, w_o, n_heads, x_cross=None, mask=None)
Computes multi-head attention.
Parameters
x : np.ndarray Input tensor of shape (seq_len, d_model)
np.ndarray
Query weights of shape (n_heads, d_model, d_k)
np.ndarray
Key weights of shape (n_heads, d_model, d_k)
np.ndarray
Value weights of shape (n_heads, d_model, d_k)
np.ndarray
Output projection matrix of shape (d_model, d_model)
int
Number of attention heads
np.ndarray, optional
Encoder output of shape (src_seq_len, d_model). If provided, keys and values come from x_cross (cross-attention). If None, keys and values come from x (self-attention).
mask : np.ndarray, optional Mask of shape (seq_len, seq_len). Use make_causal_mask() for masked self-attention in the decoder.
Returns:
np.ndarray Output tensor of shape (seq_len, d_model)
scaled_dot_product_attention(q, k, v, mask=None)
Compute scaled dot-product attention.
Parameters
q : np.ndarray Query matrix of shape (seq_len, d_k)
np.ndarray
Key matrix of shape (seq_len, d_k)
np.ndarray
Value matrix of shape (seq_len, d_k)
mask: np.ndarray or None Mask to apply during attention computation. Must be of shape (seq_len, seq_len)
Returns:
np.ndarray Output matrix of shape (seq_len, d_k)
transformers_lab.positional_encoding
Positional encoding.
sinusoidal_positional_encoding(seq_len, d_model)
Compute sinusoidal positional encoding.
Parameters
seq_len : int Length of the input sequence
int
Dimensionality of the model
Returns:
np.ndarray Positional encoding matrix of shape (seq_len, d_model)
transformers_lab.self_attention
Compute self-attention.
scaled_dot_product_attention(q, k, v, mask=None)
Compute scaled dot-product attention.
Parameters
q : np.ndarray Query matrix of shape (seq_len, d_k)
np.ndarray
Key matrix of shape (seq_len, d_k)
np.ndarray
Value matrix of shape (seq_len, d_k)
mask: np.ndarray or None Mask to apply during attention computation. Must be of shape (seq_len, seq_len)
Returns:
np.ndarray Output matrix of shape (seq_len, d_k)
self_attention(x, w_q, w_k, w_v, mask=None)
Compute single-head self-attention.
Parameters
x : np.ndarray Input embeddings of shape (seq_len, d_model)
np.ndarray
Query projection matrix of shape (d_model, d_k)
np.ndarray
Key projection matrix of shape (d_model, d_k)
np.ndarray
Value projection matrix of shape (d_model, d_k)
np.ndarray, optional
Mask of shape (seq_len, seq_len). Defaults to None.
Returns:
np.ndarray Output of shape (seq_len, d_k)
transformers_lab.softmax
Softmax is used to!
- transform arbitrary real-valued scores into probabilities
- ensure all values are positive
- ensure the probabilities sum to 1
- normalize scores along a specified axis
In Transformers, softmax is used to convert similarity scores (attention scores) into attention weights.
softmax(x, axis=-1)
Compute the softmax function in a numerically stable way.
The softmax function transforms arbitrary real-valued scores into positive probabilities that sum to 1 along the specified axis.
In Transformers, it is commonly used to convert attention similarity scores into attention weights.
:param x: NumPy array containing the input scores. :param axis: Axis along which the normalization is applied. Defaults to the last axis. :return: NumPy array of the same shape as x containing the normalized probabilities.