Smart Chemical Systems

Data-Driven Chemical Process Design

Speeding up chemical process development with data science.

5 min readSep 25, 2023

Knowledge Management

A chemical process is a manufacturing stage for the transformation (e.g., synthesis, crystallization, formulation) of chemical entities/compounds or materials. Basically, it is a recipe. Its development and design are labor-intensive, costly endeavors, since a lot of experimentation goes into finding adequate (or optimal) operating settings. Hence, getting to “good” process designs faster by reducing the amount of experimental work can save a lot of money. Expert knowledge is an extremely important asset here. With more expertise, better designs are obtained more quickly. Over the years, established chemical companies have accumulated a tremendous amount of experience and knowledge on these kinds of topics. The knowledge is usually passed on from incumbent scientists to the next generation, but with the ongoing demographic shifts, they are slowly dropping away.

This leads to a reduction in the available workforce and a lower supply of labor, resulting in higher labor costs and a loss of valuable expertise. As a result, there is a need for higher operational efficiency and better management of knowledge. A promising strategy could be to suggest and evaluate the usefulness of process designs in a “more automated way”. More specifically, one could attempt to make predictions about process designs on the basis of the knowledge gathered in past projects. The purpose of this rather high-level post is to take that idea and expand on it.

Expert Systems

In abstract terms, the rules and heuristics that are applied in process engineering are latent (i.e., hidden) in the design choices made. This is triggered, for instance, by the molecular structures that induce macroscopic properties. To give a concrete example, a molecule without functional groups, such as alkanes, will only dissolve in apolar solvents (i.e., other alkanes). Therefore, only these solvents can be used for its homogeneous reaction. This implicitly results in the rule that “like dissolves like.” In the associated process designs, this would be recovered.

Thinking this further, similar chemical properties and reactions should induce similar process designs. After all, such rules are also applied by experts when they solve problems; similar problems imply similar solutions. If one succeeded in adequately extracting and representing this latent knowledge from data, then one could search for design proposals as a function of compounds and chemical “operations” (Fig. 1).

**Fig. 1:** The process development knowledge from past projects is used to create a representation of it. It is then accessed to obtain designs for new projects. (©Georgi Tancev)

This approach is similar to previous expert systems, with design already applied. The difference between now and the past is that knowledge no longer has to be acquired from the experts first, but is directly learned from data. Data-driven process design is certainly a way to facilitate the work of scientists and engineers. However, the road ahead is rocky.

Data Management

First, the question arises to what extent all work has been even documented. In many places, the relevant experimental data is probably already recorded in electronic laboratory notebooks and laboratory information management systems. If insufficient data are available, however, the situation is rather hopeless. Perhaps the data are available, but they are analog and scattered. In the beginning, one would have to make an inventory and the past knowledge needs to be gathered. Typically, this represents the greater part of the task. Once this knowledge is accessible through databases, it may be capitalized. Above all, it is also important to be able to rely on curated, high-quality data.

Process Representations

What exactly is a process, though? The fact that processes are hierarchical/modular structures needs to be addressed. The chances of success are better if individual unit operations are focused on. Getting designs for certain manufacturing steps is also probably easier than from others, as the complexity and amount of available data play a crucial role. In addition, there is the question of how exactly to represent a process design. Which data must be included for a good basic design? What kind of design prototypes would make the most sense? In any case, a value must be delivered with the design proposals. For instance, a design for a single operation may be represented as a multi-dimensional random vector with continuous and categorical variables.

Design Prediction

In order to make design suggestions, the similarity of the tasks must be compared. How would one compare the similarity of two problems, though? The resulting software could behave like a hash table (i.e., dictionary) in which molecular structures and additional characteristics serve as keys and designs represent values. In the simplest form, the input compounds could be compared to predict an appropriate design configuration — assuming a fixed chemical transformation. Two molecules with similar structures should require similar treatments, i.e., designs, compared to less similar ones.

To compare the similarity between compounds, one can use, for instance, graph kernels. A graph kernel is a function that captures the inner product between two graphs. But the molecular structure alone is often not enough. Hence, new kernels can be created from existing ones through kernel engineering to provide more information about a system (e.g., words in documents, type of chemical reaction, bulk properties, and so on). Algorithms such as support vector machines or Gaussian processes rely on such kernels, e.g., for design prediction. A modern approach would be to store a collection of documents in a vector database. The vector database could store peer-reviewed publications and internal documents. These can be used in combination with large language models and retriever-augmented generation (“RAG”) to refine answers to questions such as “What method should I use for this problem?”. For instance, large language models with SMILES have been used for chemical reaction prediction.

Impediments

While the benefits are clear, there is no guarantee that the proposed designs will be any good. On the one hand, new projects may be too different from previous ones, so that the necessary design rules may not yet be available in the training data set. This is a typical problem of data-driven solutions. On the other hand, the granularity of the designs may be inappropriate, i.e., important information may be missing or incorrect. This can happen, for example, if there is “too little signal” in the training data set. Consequently, users should not have blind faith in such a system, and the introduction of a confidence metric for predictions as a measure of trustworthiness is particularly important.