references
Chapter 1
[1] Vaswani, Ashish, et al. (2017). Attention is all you need. arXiv. http://arxiv.org/abs/1706.03762.
Chapter 2
[1] Vaswani, Ashish, et al. (2017). Attention is all you need. arXiv. http://arxiv.org/abs/1706.03762
Chapter 3
[1] Devlin, Jacob, et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://arxiv.org/abs/1810.04805.
[2] Liu, Yinhan, et al. (2019). RoBERTa: A robustly optimized Bert pretraining approach. arXiv. https://arxiv.org/abs/1907.11692.
[3] Dosovitskiy, Alexey, et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv. https://arxiv.org/abs/2010.11929.
[4] Warner, Benjamin, et al. (2024). Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. arXiv. https://arxiv.org/abs/2412.13663.
Chapter 4
[1] Wei, Jason, et al. (2023). Chain-of-thought prompting elicits reasoning in large language models. Version 6. arXiv. https://arxiv.org/abs/2201.11903.
[2] Chia, Yew Ken, et al. (2023). Contrastive chain-of-thought prompting. arXiv.https://arxiv.org/abs/2311.09277.
[3] Dhuliawala, Shehzaad, et al. (2023). Chain-of-verification reduces hallucination in large language models. Version 2. arXiv.https://arxiv.org/abs/2309.11495.
[4] Yao, Shunyu, et al. (2023). Tree of thoughts: Deliberate problem solving with large language models. Version 2. arXiv. https://arxiv.org/abs/2305.10601.