Compute a Multi-Head, Self-Attention Weight Matrix by taking attention for each head and concatenating them; finally multiplying by the overall weight matrix w_o. The operator ++^ concatenates matrices column-wise.
Compute a Multi-Head, Self-Attention Weight Matrix by taking attention for each head and concatenating them; finally multiplying by the overall weight matrix w_o. The operator ++^ concatenates matrices column-wise.
Value parameters
k
the key matrix K
q
the query matrix Q (q_t over all time)
v
the value matrix V
w_o
the overall weight matrix to be applied to concatenated attention
w_q
the weight tensor for query Q (w_q(i) matrix for i-th head)
w_v
the weight tensor for value V (w_v(i) matrix for i-th head)