API

`transformers_lab.feed_forward`

Feed forward neural network module.

`FeedForward`

Position-wise Feed Forward Network used in Transformer.

`init(d_model, d_ff, init_weight_fn=xavier_init)`

Initialize the feed-forward module.

Parameters

d_model : int Model dimensionality d_ff : int Hidden layer dimensionality init_weight_fn : Callable[[int, int], np.ndarray] Weight initialization function. Defaults to xavier_init.

`forward(x)`

Run forward pass.

Parameters

x : np.ndarray Shape (seq_len, d_model)

Returns:

np.ndarray Shape (seq_len, d_model)

`relu(x)`

ReLU activation function.

`xavier_init(n1, n2, n_heads=None)`

Xavier Glorot initialization for weight matrices.

Parameters

n1 : int Input dimension. n2 : int Output dimension. n_heads : int, optional If provided, returns shape (n_heads, n1, n2). If None, returns shape (n1, n2).

Returns:

np.ndarray Shape (n_heads, n1, n2) or (n1, n2).

`transformers_lab.layer_norm`

layer normalisation implementation.

`LayerNorm`

Layer normalization implementation.

`call(x)`

Apply layer normalization.

Parameters

x : np.ndarray Input tensor of shape (..., d_model).

Returns:

np.ndarray Normalized tensor with the same shape as input.

`init(d_model, eps=1e-06)`

Initialize the LayerNorm module.

Parameters

d_model : int Dimensionality of the input. eps : float, optional Value added to the denominator for numerical stability. Default is 1e-6.

Attributes:

gamma : np.ndarray Scale parameter of shape (d_model,). beta : np.ndarray Shift parameter of shape (d_model,). eps : float Numerical stability constant.

`transformers_lab.multihead_attention`

Compute multi-head-attention.

`multi_head_attention(x, w_q, w_k, w_v, w_o, n_heads, x_cross=None, mask=None)`

Computes multi-head attention.

Parameters

x : np.ndarray Input tensor of shape (seq_len, d_model)

np.ndarray

Query weights of shape (n_heads, d_model, d_k)

np.ndarray

Key weights of shape (n_heads, d_model, d_k)

np.ndarray

Value weights of shape (n_heads, d_model, d_k)

np.ndarray

Output projection matrix of shape (d_model, d_model)

int

Number of attention heads

np.ndarray, optional

Encoder output of shape (src_seq_len, d_model). If provided, keys and values come from x_cross (cross-attention). If None, keys and values come from x (self-attention).

mask : np.ndarray, optional Mask of shape (seq_len, seq_len). Use make_causal_mask() for masked self-attention in the decoder.

Returns:

np.ndarray Output tensor of shape (seq_len, d_model)

`scaled_dot_product_attention(q, k, v, mask=None)`

Compute scaled dot-product attention.

Parameters

q : np.ndarray Query matrix of shape (seq_len, d_k)

np.ndarray

Key matrix of shape (seq_len, d_k)

np.ndarray

Value matrix of shape (seq_len, d_k)

mask: np.ndarray or None Mask to apply during attention computation. Must be of shape (seq_len, seq_len)

Returns:

np.ndarray Output matrix of shape (seq_len, d_k)

`transformers_lab.positional_encoding`

Positional encoding.

`sinusoidal_positional_encoding(seq_len, d_model)`

Compute sinusoidal positional encoding.

Parameters

seq_len : int Length of the input sequence

int

Dimensionality of the model

Returns:

np.ndarray Positional encoding matrix of shape (seq_len, d_model)

`transformers_lab.self_attention`

Compute self-attention.

`scaled_dot_product_attention(q, k, v, mask=None)`

Compute scaled dot-product attention.

Parameters

q : np.ndarray Query matrix of shape (seq_len, d_k)

np.ndarray

Key matrix of shape (seq_len, d_k)

np.ndarray

Value matrix of shape (seq_len, d_k)

mask: np.ndarray or None Mask to apply during attention computation. Must be of shape (seq_len, seq_len)

Returns:

np.ndarray Output matrix of shape (seq_len, d_k)

`self_attention(x, w_q, w_k, w_v, mask=None)`

Compute single-head self-attention.

Parameters

x : np.ndarray Input embeddings of shape (seq_len, d_model)

np.ndarray

Query projection matrix of shape (d_model, d_k)

np.ndarray

Key projection matrix of shape (d_model, d_k)

np.ndarray

Value projection matrix of shape (d_model, d_k)

np.ndarray, optional

Mask of shape (seq_len, seq_len). Defaults to None.

Returns:

np.ndarray Output of shape (seq_len, d_k)

`transformers_lab.softmax`

Softmax is used to!

transform arbitrary real-valued scores into probabilities
ensure all values are positive
ensure the probabilities sum to 1
normalize scores along a specified axis

In Transformers, softmax is used to convert similarity scores (attention scores) into attention weights.

`softmax(x, axis=-1)`

Compute the softmax function in a numerically stable way.

The softmax function transforms arbitrary real-valued scores into positive probabilities that sum to 1 along the specified axis.

In Transformers, it is commonly used to convert attention similarity scores into attention weights.

:param x: NumPy array containing the input scores. :param axis: Axis along which the normalization is applied. Defaults to the last axis. :return: NumPy array of the same shape as x containing the normalized probabilities.

API

transformers_lab.feed_forward

FeedForward

__init__(d_model, d_ff, init_weight_fn=xavier_init)

Parameters

forward(x)

Parameters

Returns:

relu(x)

xavier_init(n1, n2, n_heads=None)

Parameters

Returns:

transformers_lab.layer_norm

LayerNorm

__call__(x)

Parameters

Returns:

__init__(d_model, eps=1e-06)

Parameters

Attributes:

transformers_lab.multihead_attention

multi_head_attention(x, w_q, w_k, w_v, w_o, n_heads, x_cross=None, mask=None)

Parameters

Returns:

scaled_dot_product_attention(q, k, v, mask=None)

Parameters

Returns:

transformers_lab.positional_encoding

sinusoidal_positional_encoding(seq_len, d_model)

Parameters

Returns:

transformers_lab.self_attention

scaled_dot_product_attention(q, k, v, mask=None)

Parameters

Returns:

self_attention(x, w_q, w_k, w_v, mask=None)

Parameters

Returns:

transformers_lab.softmax

softmax(x, axis=-1)

`transformers_lab.feed_forward`

`FeedForward`

`init(d_model, d_ff, init_weight_fn=xavier_init)`

`forward(x)`

`relu(x)`

`xavier_init(n1, n2, n_heads=None)`

`transformers_lab.layer_norm`

`LayerNorm`

`call(x)`

`init(d_model, eps=1e-06)`

`transformers_lab.multihead_attention`

`multi_head_attention(x, w_q, w_k, w_v, w_o, n_heads, x_cross=None, mask=None)`

`scaled_dot_product_attention(q, k, v, mask=None)`

`transformers_lab.positional_encoding`

`sinusoidal_positional_encoding(seq_len, d_model)`

`transformers_lab.self_attention`

`scaled_dot_product_attention(q, k, v, mask=None)`

`self_attention(x, w_q, w_k, w_v, mask=None)`

`transformers_lab.softmax`

`softmax(x, axis=-1)`