AI and machine learning (ML) are ubiquitous in the broader world culture this year, which is fairly gratifying to those who have been working in the field since long before that could be said. The subject and the name go back to a 1956 summer workshop held at the Dartmouth mathematics department, and since then it has experienced several cycles of boom and bust. But it has never before had the broad impact it is clearly having today. Examples need hardly be listed but include the massive growth in capitalization of AI-adjacent technology companies, the disruptive effects of text generation by large language models (LLMs) on human communication inside and outside of academia, and a general smartening up of all the technological systems we interact with. Naturally scientists want to be in on such a significant increase of functionality for their own work, especially considering the numerous technical connections between AI/ML mathematical methods and the mathematics of the sciences.

Unfortunately some of the major recent advances in AI/ML seem fundamentally incompatible with the scientific spirit, which is above all skeptical. Scientists ideally demand and work to establish some form of “proof”, founded in experiment or observation of nature, that greatly reduces any room for doubt before new knowledge can be provisionally accepted and built upon. Scientific methodology evolves to satisfy this criterion, and builds a huge cross-referenced structure in the scientific literature that supports subsequent rapid inferences by working scientists of what is likely to be true, and with what caveats and uncertainty (keeping in mind that some papers are just wrong), given our knowledge so far. But LLMs and other generative models excel mainly at creating new information that reflects some particular aspect of a very large training set of data, for example in response to a query. They do not yet excel at output validity, or anything like valid scientific results let alone a scientific version of proof.

LLMs work on just one side of a very old, standard AI paradigm: generate-and-test. They are all “generate” with very little validity “testing” of a candidate output. They hallucinate. And they exhibit little if any expert-level judgement. Twice this year I have had commercially deployed LLMs confidently give me numerical estimates that contained errors by a factor of 1000 – completely unacceptable in scientific reasoning. One of them was due to not converting units; both involved training on unreliable sources. Any kind of automated internal critic should have caught these and many other errors – but would probably also mute huge swaths of LLM output as unreliably known, a characteristic also of the great majority of any training data set based on what people usually communicate to each other. In fairness, the recent release of a different LLM can track units and deliver the correct answer. However the first LLM now doubles down on the wrong answer by also taking its reciprocal.

Approaches to dealing with this incompatibility are varied. Not all progress in AI/ML involves LLMs or more broadly generative models, despite all the publicity, so one can focus on those other methods. Popular examples include deep neural networks that train on scientific data rather than natural language corpora. One can use human experts as the missing “critics”, who for example filter and pick suggested genes from ML output and go test their predicted functionality in the lab. Or one can couple generative models with software-implemented critics based on one of the oldest methods in AI, automated logic and theorem-proving.

Within formal logic it is by definition possible to completely check every proposed proof, yielding a hard judgement of “valid” or “not valid”. Not valid just means the proof doesn’t work yet, not that it can’t with further refinement or that the negation of the desired theorem is valid. There are current systems that use LLMs to help suggest the missing steps in a proof that seems obvious to a human being but are not yet valid in the formal sense, in dialog with the automatic proof checker that will brook no fallacies, evasions, hidden assumptions, or fuzzy thinking. Thus, these systems use automatic theorem proving and interactive theorem verification (ATP/ITV) as the missing critic to LLM’s imaginings. LLMs are not yet dominant in this role though. Similarly LLMs are used to help programmers, but the programmer must take responsibility for verification beyond whatever can be put in an automated test suite or caught with existing code verification tools.

Philosophers such as David Hume and Charles Sanders Peirce have long observed that scientific reasoning includes but is much more adventurous than logical “deductive” reasoning, admitting also empirical “induction” and “abduction” which may or may not be understandable, as physicists Edwin T. Jaynes and Harold Jeffreys suggested, in terms of probabilistic reasoning using Bayes’ theorem. Bayes’ theorem in turn can be automated to some extent and has at long last been admitted to the canon of AI methods. So the potential role of reliable AI reasoning in science would not be fully encompassed just by making AI work well in a theorem proving system. On the other hand making it work in a theorem proving system whose domain of effective action included applied mathematics, of the sort (including partial differential equations and large stochastic models) that is central in the physical and increasingly the biological sciences, would be quite important. That goal however requires ATP/ITV to “know” substantially more mathematics than it currently does. The classic AI topic of knowledge representation as it may apply to science and science-applied mathematical work is a rich and deep one, ready for a new generation of research.

My own preferred approach to these problems might be summarized as “ML on tap, not on top”. That is, design an overall knowledge representation architecture centered on key concepts of the science, including  simulation of particular interacting natural processes, but leave room for learning functions and graph-like structures inside of that architecture where, as in the ATP case, any damage they do to scientific validity can be limited. I have outlined the “Tchicoma” conceptual architecture for combining symbolic AI (e.g. computer algebra and theorem proving) with numerical ML and simulation, all specialized to the support of science.

Key Tchicoma concepts are the use of multiple, nested, formal languages in each of four sectors: science, mathematics, computing, and a middle domain of modeling. In the central modeling domain, the formal languages could be based on expressive “dynamical graph grammar” (DGG) models which coworkers and I have developed over the years. DGGs separate complex process models into bite-sized “rules” that are always quietly listening for their cue, but each only activated in a very specific context, in which a hand-designed or machine-learned physical model specifies what should happen next. The mathematical meaning of such DGGs is well defined. They support dynamically changing graph structures, and we have investigated the continuum limits of such structures which should be a powerful tool for analysis. DGGs are compatible with the essentially reductionist approach of modern science, with the understudied possibility of machine-learned mappings between models of the same process at different spatial and temporal scales. DGGs are very expressive in computational biology and biophysics at least. So I think this is at least one vision and framework that, if developed with a tiny fraction of the intellectual energy going into LLMs, would be effective in producing reliable AI for science.

By contrast, the dominant mainstream AI/ML approach seems to be to hope for the cryptic emergence of domain relevant representations deep within many-layer neural network architectures, allowing high scores on benchmark datasets but little human understanding of the reasons for functional success. This now works so well that it hoovers up most of the available intellectual energy, leaving little for potentially more reliable and science-compatible approaches to explanation. A direction for improvement may be to impose low-bandwidth bottlenecks analogous to human communication, which works wonders in crystalizing our own thoughts into more functional information-rich artifacts up to and including mathematical abstractions.

One place where the LLM/science incompatibility obstacle is especially acute is in the use of natural language processing systems, currently dominated by LLMs but previously the domain of the “information retrieval” community, to track and mine the vast proliferation of scientific papers. In many fields there are dozens of papers published per day, a rate that is nearly impossible for a human expert to keep up with. We are reliant first on human editors but increasingly on automated systems to guess which peer-reviewed papers will seem important and fruitful enough to us to even read the abstract, let alone spend time carefully understanding the full paper. Sadly these automated systems have a deficient “understanding” of the graph-like deep structure and logic of every individual paper and its potential role in extending or correcting a reliable superstructure of interrelated scientific knowledge, because as yet they neither understand highly reliable reasoning in the scientific style nor can they actually do it. Of course, LLM research teams are working to increase the reasoning capability of their their products.

And yet, the power of all of the aforementioned approaches to AI in general is now manifest for all to see. The puzzles outlined here are solvable. The “very applied mathematics” required looks to be fascinating. Let’s get to work!