Atoms as Words: A Novel Approach to Deciphering Material Properties using NLP-inspired Machine Learning on Crystallographic Information Files (CIFs)
In condensed matter physics and materials science, predicting material properties necessitates understanding intricate many-body interactions. Conventional methods such as density functional theory (DFT) and molecular dynamics (MD) often resort to simplifying approximations and are computationally expensive. Meanwhile, recent machine learning methods use handcrafted descriptors for material representation which sometimes neglect vital crystallographic information and are often limited to single property prediction or a sub-class of crystal structures. In this study, we pioneer an unsupervised strategy, drawing inspiration from Natural Language Processing (NLP), to harness the underutilized potential of Crystallographic Information Files (CIFs). We conceptualize atoms and atomic positions within a CIF similarly to words in textual content. Using a Word2Vec-inspired technique, we produce atomic embeddings that capture intricate atomic relationships. Our model, CIFSemantics, trained on the extensive Material Project dataset, adeptly predicts 15 distinct material properties from the CIFs. Its performance rivals specialized models, marking a significant step forward in material property predictions.