Generalized End-to-End (GE2E) loss is a loss function often used in speaker verification and speaker embedding tasks. It was introduced by Google in their work on deep speaker embedding systems, and it helps models learn discriminative speaker embeddings that cluster tightly for the same speaker while being well-separated from embeddings of other speakers.
It is useful for speaker discrimination, batch-based training and tighter clustering.
How It Works
It operates by:
- Comparing speaker embeddings in a batch
- Encouraging embeddings of the same speaker to align closely (causes intra-class compactness)
- Forcing embeddings of different speakers to diverge (inter-class separation)
It does this by leveraging Cosine Similarity Loss. Given a batch of speakers, each with utterances, GE2E loss works as follows:
- Extract embeddings to produce where is the speaker index and is the index of the utterance for that speaker
- Calculate centroids for each speaker $$ c_i = \frac{1}{M} \sum_{j=1}^{M} e_{i,j}
- Measure the cosine similarity between each embedding and all centroids. Below, is the similarity of the -th embedding of speaker to the centroid of speaker $$ s_{i,j,k} = \frac{\langle e_{i,j}, c_k \rangle}{|e_{i,j}| |c_k|}
The GE2E loss function is a softmax-based formulation that forces embeddings to align with their correct speaker centroid while being dissimilar to others.