Machine Translation between Spoken Languages and Signed LanguagesRepresented in SignWriting (Jiang et al. 2023)

In SignBank, there are roughly 220k parallel samples from 141 puddles covering 76 language pairs, yet the distribution is unbalanced
Notably, most of the puddles are dictionaries, which we consider less valuable than sentence pairs (instances of continuous signing) for a general MT system. If dictionaries are used as training data, we expect models to memorize word mappings and not learn to generate sentences.
- Therefore, we treat the four sentence-pair puddles of the relatively high-resource language pairs as primary data and the other dictionary puddles as auxiliary data. Note that even the language pairs constituting the high-resource pairs of SignBank are low-resource compared to datasets used in mature MT systems for spoken languages, where millions of parallel sentences are commonplace
en-us (American English & American Sign Language) has 43,698 pairs
For encoding (when FSW is the source), they embed each factor separately and then concatenate them to the aligned symbol’s embedding. For decoding (when FSW is the target), they use a subset of factors (absolute x and y) because others are irrelevant for prediction.

It is not obvious how to best decode to FSW. We try the following strategies:

predicting everything (including positional numbers) as target tokens all in one long target sequence, inspired by Chen et al. (2022)
predicting symbols only (as a comparative experiment)
predicting symbols with positional numbers as target factors

🪴 Knowledgebase