Skip navigation

Specialising and analysing instruction-tuned and byte-level language models for organic reaction prediction

Specialising and analysing instruction-tuned and byte-level language models for organic reaction prediction

Pang, Jiayun ORCID logoORCID: https://orcid.org/0000-0003-0689-8440 and Vulic, Ivan (2024) Specialising and analysing instruction-tuned and byte-level language models for organic reaction prediction. Faraday Discussions. ISSN 1359-6640 (Print), 1364-5498 (Online) (doi:10.1039/D4FD00104D)

[thumbnail of Open Access Article]
Preview
PDF (Open Access Article)
48469 PANG_Specialising_And_Analysing_Instruction-Tuned_And_Byte-Level_Language_Models_(OA)_2024.pdf - Published Version
Available under License Creative Commons Attribution.

Download (1MB) | Preview
[thumbnail of Supplemental material]
Preview
PDF (Supplemental material)
48469 PANG_Specialising_And_Analysing_Instruction-Tuned_And_Byte-Level_Language_Models_(SUPPLEMENTARY)_2024.pdf - Supplemental Material
Available under License Creative Commons Attribution.

Download (1MB) | Preview

Abstract

Transformer-based encoder–decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: can FlanT5 and ByT5, the encoder–decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become ‘chemistry domain compatible’ in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential, to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; the most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.

Item Type: Article
Uncontrolled Keywords: organic reaction prediction, AI, language models, finetuning, byte-level tokenisation
Subjects: Q Science > Q Science (General)
Q Science > QA Mathematics
Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Faculty / School / Research Centre / Research Group: Faculty of Engineering & Science
Faculty of Engineering & Science > School of Science (SCI)
Last Modified: 17 Dec 2024 10:24
URI: http://gala.gre.ac.uk/id/eprint/48469

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics