preface
In early 2022, after finally having the time to complete reading the “Attention is all you need” paper, I started getting curious about the Transformer architecture and the potential great use cases in diverse industries (and in life sciences in particular, because that is the field where I work) for this new kind of neural network. One concern I had at that time was the concrete risk that such technology could quickly become a prerogative of the large tech organizations that could afford the vast computational resources necessary to train and execute these models.
Then, inspired by the “run Doom on anything” challenge (a popular trend for software engineers to find ways to optimize the source code of that 1993 videogame to run on any sort of device), I started thinking about ways to optimize small Transformer models, trained on domain-specific tasks and data, so they could be deployed and executed in hardware-constrained environments. In June of the same year, I did an in-person hands-on workshop on this subject at the ODSC Europe conference in London, which got a lot of interest from ML engineers at the event.