Computation and its Connotations
A Review of Language Machines by Leif Weatherby
When I first encountered Word2Vec about ten years ago my first thought was that it was the clearest vindication of Saussurean structuralism anyone had ever produced, and that people in the humanities should be all over this. Needless to say, they did not get all over it. Even as deep neural network (DNN) based approaches to natural language processing (NLP) became increasingly complex and effective, eventually giving birth to large language models (LLMs) like ChatGPT and its competitors. Even as these successes increasingly discredited the Chomskian paradigm that had booted structuralism out of linguistics and left it languishing in exile across the humanities. I’ve been waiting a decade for someone to write a book about this, and my wait is finally over. Leif Weatherby’s Language Machines sets out not just to explain why structuralists have so far failed to embrace the deep learning revolution but to correct this oversight, trying to interpret what exactly LLMs are doing and extract lessons applicable to human linguistic culture.
To rehearse the basic point, Word2Vec represents words as vectors embedded in a high-dimensional space, but the absolute value of each word vector is irrelevant. All that matters is its value relative to other word vectors. Beginning with different random weights, two networks trained up on the same textual corpus will tend to converge on the same relative values. These differences between values encode semantic relationships that can be expressed by equations such as ‘king - man + woman = queen’. The vector spaces encoded by the transformer architectures deployed by LLMs are vastly bigger and more complicated, and their internal relations aren’t so algebraically transparent, but there is evidence that they converge in similar ways. This validates what Weatherby calls Saussure’s differential hypothesis, namely, that linguistic meaning is principally determined by relations of difference between a system of arbitrary signs. Weatherby’s gambit is then that ideas taken from the study of signs pioneered by Saussure and Peirce — semiotics — can be fruitfully applied to LLMs and their computational kin.
So, why are structuralists so hesitant to take the win? Weatherby gives three reasons. The first is that in exiling structuralists to the humanities, the Chomskian revolution discouraged the focus on language as a concrete object that is essential to structuralist linguistics in favour of a more diffuse concern with culture. The second is that poststructuralism (and in particular Derrida’s interpretation of Saussure) deliberately encouraged this shift by attacking the closure of sign systems, opening natural language onto its outside (and in Derrida’s case swallowing it whole) while ignoring the relative independence and internal consistency of formal languages in mathematics and computer science. The third is that the humanities have become gripped by a remainder humanism that ‘establishes a moving yet allegedly bright line between human and machine.’ This is evident everywhere in the current culture war over AI, where academics marshalled into opposing big tech are willing to forget every antihumanist critique of the self-subsistent subject engendered by structuralism (see Althusser and Foucault) in order to summon some special feature of the human spirit that machines might never emulate — be it emotion, imagination, or life itself. In particular, he highlights the prevalence of a problematic dichotomy between embodiment and abstraction that is one of my own bug bears.
What does Weatherby have to offer by contrast? Quite a lot, though I might hesitate to call it comprehensive or systematic. His most important methodological contribution consists in attempting to substract language from deeper questions to which it is supposedly inextricably bound.
On the one hand, against Chomsky and others in cognitive science and analytic philosophy of mind, he insists on bracketing the cognitive role of language, and debates about whether or not LLMs are ‘intelligent’ along with it. He argues that the generative character of LLMs foregrounds the communicative role of language, and questions about whether or not they capture ‘culture’ along with it. This isn’t to say that Weatherby has nothing to say about cognition, just that he thinks language qua sign system can and should be understood on its own terms. A fortiori, he seems to think that language’s distinctive cognitive role derives from what makes it unique qua sign system — last-instance semiology — namely, the way it is able to express everything you can express in any other sign system but not vice versa.
On the other hand, against Emily Bender and others in linguistics and analytic philosophy of language, he insists on bracketing both the referential function of language — be it understood in terms of direct word-world relations or sentential truth-conditions — and the intent involved in using it to communicate — be it understood in terms of simple psychological states or the complex conversational dynamics of Gricean implicature. He argues that the statistical underpinnings of LLMs allow them to cut across any rigid distinction between rule governed syntax and referentially grounded semantics, sifting putatively pragmatic patterns of communicative interaction from the psycho-social contingencies of fleshed out personalities in determinate situations. Instead, he appropriates Claude Shannon’s information-theoretic picture of language, but betrays Shannon’s own semantic agnosticism.
Weatherby proposes that structural relations between words are implicit in the redundancy of the linguistic signal. The paradox of Shannon’s information theory is that the most informationally dense content is a random signal. This is because it has no redundancy — each new bit of information tells you absolutely nothing about what the following bits might be. Maximal content coincides with meaningless chaos. Weatherby’s argument is that genuine communication (outside things like exchanging randomly generated encryption keys) only ever involves partial uncertainty about how the signal will continue, or its ongoing trajectory. ‘Beware of the…’ is likely to be followed by ‘dog’ or maybe ‘step’, or at the very least a sequence of words that composes a singular term. The probability distribution of continuations captures not just syntactic but semantic relationships (e.g., ‘dogs’ are the type of thing you can be beware of). These same distributions are what LLMs use to generate text, by tracing more or less likely conversational trajectories. There’s thus a constitutive tension between meaning considered the static set of relationships that the words composing a complete message bear to other words (capture), and meaning considered as the dynamic tendency of an incomplete message towards its completion (generation).
Weatherby is at pains to emphasise the way that LLMs treat language as a surface without depth. One particularly perceptive point concerns the relation between classification and generation in neural network architecture. Classifiers are trained to do things like distinguish pictures of cats and non-cats, using a pre-labelled image set. The result is a network that takes image data as an input and produces a linguistic label as an output. The hope is that whatever layered internal representations the network has evolved to correctly label the training set will generalise to images outside this set. Generators were initially developed by turning such classifiers inside out, using their learned representations to go from label to image, rather than image to label. More specialised generators such as generative adversarial networks (GANs) and diffusion models have since been developed, but the inverse relationship remains, with one side taking non-linguistic input and producing linguistic output, and the other side taking linguistic input and producing non-linguistic output. However, though LLMs are technically generators they (mostly) take linguistic input and produce linguistic output. They never leave the linguistic surface. This goes double for reasoning models that feed their own outputs back to themselves as inputs in a ‘chain-of-thought’, running repeated cycles in search of reflective equilibrium.
But it’s not just that LLMs are confined to the linguistic surface, but that the linguistic outputs they produce are peculiarly self-referential. What makes transformers so powerful is their attention mechanism, which enables them to simultaneously track how tokens are relevant to one another over large swathes of text, rather than just relating each token to the next. This exquisite sensitivity to the internal relations between a message’s parts applies as much to their outputs as their inputs. This is to say that they are principally focused on the internal coherence of a message over and above its correspondence to the external world. Weatherby puts this point in Roman Jakobson’s terms, claiming that for LLMs the poetic function of language is primary, while reference and other functions are secondary. Just as the poet aims first and foremost to craft a coherent set of connections between words, forging equivalences through rhythm and rhyme heedless of consequence, so does the LLM craft a coherent response to a prompt heedless of correctness. This is why LLM writing is generally better at aping holistic vibes than it is at reporting discrete facts. Weatherby suggests that, on this basis, we might be better off calling them large literary machines.
This leads Weatherby to call for a generalised poetics capable of studying the generative dimension of language in all its forms — cultural, computational, and everything in between. He gives us a taste of what this might look like in the penultimate chapter, where he provides an ‘expressive’ account of ideology opposed to ‘transcendental’ accounts that stress deep, underlying structures influencing thought and action (e.g., the Marxist economic base or the Freudian unconscious). Sticking to the surface again, Weatherby proposes a generative spectrum with poetry at one end and ideology at the other, the deliberate search for unusual and evocative ways of combining words opposed to the linguistic path of least resistance — the repetitive tropes, cliches, and patterns of communication that unobtrusively frame whatever else we might say. He thinks we can use LLMs as a tool to study this linguistic background, mapping the cultural landscape in ways that bypass cognition — examining the faded foreground between signal and noise in which much AI generated content sits (e.g., music that exemplifies a genre without being remotely interesting), and dissecting the semantic packages which bundle words into associative groupings (e.g., ‘vaxx’ alongside ‘experts’, ‘sheeple’, ‘masking’ and ‘research’).
The book closes by announcing the return of rhetoric. One consequence of viewing LLMs as primarily a cultural rather than cognitive technology is that it provides a distinctive perspective on how they’ll be used to automate labour and thereby reconfigure the social and economic order. Weatherby argues that a world in which language is a service on demand will be one in which the sheer amount of communication will, if not overwhelm us, fundamentally change not just the way we produce text but the way in which we consume it. This is an old lesson from the economics of automation: when technology reduces the amount of labour required to do something, the amount of labour dedicated to that thing doesn’t go down, but in fact, we often end up dedicating more time and energy to it. The relevant parallel here is email, which now consumes an appreciable fraction of the time involved many jobs. But we are now facing the prospect of people using LLMs to turn brief summaries into long reports only for the recipient to use an LLM to parse the report back into the summary. No one can be quite sure what communicative norms will evolve in a situation where most communication is computationally mediated, but it’s clear that the power of words is changing. Be it directly designing prompts or anticipating the ways in which our words will be parsed and compressed, training in these new arts of communication will become essential.
I’m very glad that this book was written, I can recommend that everyone with related interests should read it, and I really hope it breeds a great deal of discussion within and without the humanities. However, that doesn’t mean I agree with everything in its pages. There are various minor things I could quibble about, such as the use and abuse of Kant — from a dubious reclassification of Chomsky as a Kantian rather than a Cartesian to an outright misreading of the schematism — or the hints toward Hegelian dialectics that don’t really coalesce beyond positing a handful of constitutive tensions. I also think that the second chapter sets up too strict an alignment between three overlapping distinctions — symbolic/subsymbolic, digital/analog, and deterministic/statistical — even if it ultimately tries to undermine the resulting dichotomy. But there are more serious issues here, and one significant difference of opinion that are worth devoting some words to. It’s perhaps best to start with another seemingly minor misinterpretation that has bigger consequences, namely, Weatherby’s identification of Frege’s notion of bedeutung (usually translated as ‘reference’) with meaning, rather than sinn (usually translated as ‘sense’).
For those who don’t know, Frege famously distinguished between sense and reference in order to explain how identity statements like ‘the morning star is the evening star’ could be genuinely informative. He proposes that ‘the morning star’ and ‘the evening star’ have the same referent (Venus) but different senses, such that you can understand one but not the other. It thus makes much more sense to say that for Frege, to understand the meaning of ‘the morningstar’ is to grasp its sense, rather than it’s reference. This is even more stark when we consider complete propositions such as ‘the morning star is the second planet from the Sun’, which Frege takes to have distinct senses but only two possible referents (the true or the false). I don’t entirely blame Weatherby for getting this wrong, as Frege also takes the sense of a proposition to be given by its truth conditions. This is genuinely responsible for the dominant conception of semantic content in analytic philosophy, of which Weatherby is so justifiably critical. But it does invite the question: are Fregeans doomed to treat reference/representation as primary in precisely the manner that LLMs make a clear case against?
In short, the answer is no. But you have to first understand the real difference between Saussurians and Fregeans: whether or not one treats isolated signs (e.g., ‘dog’, ‘vaxx’, ‘Hegel’) or whole propositions (e.g., ‘dogs are mammals’, ‘vaxxed sperm is inferior’, or ‘Hegel knows his dialectics’) as methodologically primary in giving an account of linguistic meaning. For Saussure (and Peirce), there are non-linguistic signs that are never involved in composing propositions, and it makes more sense to begin with a general theory of signs (semiotics) and then give a theory of linguistic meaning (semantics) as a species. For Frege, what distinguishes language from other forms of communication is its capacity to articulate truths, and only whole propositions can be true or false, so it makes sense to start there and analyse component signs in terms of their contribution to the truth or falsity of sentences. It’s worth pointing out that this also prioritises declarative sentences and the speech act of assertion more generally, though the tradition has developed fruitful analyses of questions, requests, commands, and other speech acts in propositional terms.
So the question is really whether one can prioritise the propositional without thereby prioritising reference/representation. The longer answer is yes, because you can focus on their role in inference instead of the way the world makes them true. This is the starting point of the inferentialist account of language pioneered by Wilfrid Sellars and elaborated by Robert Brandom. This approach to meaning is pragmatist, insofar as it follows Wittgenstein’s dictum that ‘meaning is use’. But it differs from Wittgenstein in claiming that the core language-game in terms of which this use must be understood is what Sellars called the ‘game of giving and asking for reasons’ (GOGAR). It is also arguably structuralist, insofar as it takes meaning to consist in the relations between words, it simply understands these relations principally in terms of the relations between the sentences they compose. There is even already a paper arguing that inferentialism is a better fit for LLMs than mainstream representationalist semantics, though it does a poor job of explaining the details of inferentialism in my view.
As a card-carrying inferentialist, I ought to be able to do a better job, so here’s a very brief summary. Inferentialism distinguishes between formal and material inferences. Formal inferences are those that are explicitly licensed by additional premises (e.g., ‘Socrates is a man, and if something is a man then it is mortal, therefore Socrates is mortal’). Material inferences are those that are implicitly licensed by the meanings of the terms involved (e.g., ‘Socrates is a man therefore Socrates is a mammal’, ‘It’s Tuesday today therefore it’s Wednesday tomorrow’, ‘LA is to the West of NYC, therefore NYC is to the East of LA’). The meaning of the relevant terms (e.g., ‘man/mammal’, ‘Tuesday/Wednesday’, ‘East/West’) is their contribution to the inferential role of the sentences they compose, which can be made explicit by the use of conditional statements (e.g., ‘If x is to the West of y, then y is to the East of x’). The goodness of these inferences is not dependent on formalisation, but being able to make them explicit allows us to criticise and thereby potentially revise them, changing the meaning of our terms (e.g., ‘If water is heated it will boil’ can become ‘If water is heated to 100C at 1 atmosphere of pressure it will boil’ or ‘If something is a particle then it is not a wave’ can be thrown out entirely).
It’s also important to point out that there’s more to a sentence’s role in GOGAR broadly construed than just inference in the narrow sense. Inference is what Sellars calls a language-language move: it takes us from sentences to sentences. But there are also language-entry moves, such as looking out the window and then saying ‘It’s raining outside’, and language-exit moves, such as hearing ‘It’s raining outside’ and picking up an umbrella before stepping out the door. Understanding sentences and the words that compose them involves having some grasp of the role in the wider ecology of perception-inference-action, but inference is the most important bit precisely insofar as it ties everything else together. A blind biologist might get away without being able to identify a Tiger on sight, but they can’t get away with not understanding that if something is a Tiger then it is an animal. Interestingly enough, this almost perfectly mirrors the picture of classification-reasoning-generation in neural networks we touched on earlier.
Now, you might ask, isn’t this precisely the sort of cognitive account of language of which Weatherby is trying to disabuse us? Well, yes, though I think there’s some middle ground here, and that appreciating the extent to which he fails to adequately thematise inference points to deeper problems elsewhere in his account.
I’ll start with the one explicit discussion of inference in the book, in relation to what Weatherby calls the data hypothesis. Here he explains that the standard model of what’s going on with classifiers is that training is a form of induction while application is a form of deduction — the network learns a rule from the cases (e.g., images) and results (e.g., labels) in its training set and then applies this rule to new cases to yield new results. He thinks that this is wrong, principally because the network isn’t engaging with the world directly, but only operating on data mediated by human judgment. Instead, he proposes that we see classification as abduction, or the production of hypotheses. The eponymous data hypothesis is essentially that the burgeoning ecosystem of classifiers and LLMs designed to mediate the world for us don’t produce their own judgements, but rather options for judgment that we can adopt should we choose. This isn’t exactly wrong, but I think it’s framed in entirely the wrong terms. His Peircean take on abduction — working backwards from rules and results to cases — isn’t the best fit for classifiers. It suggests that the label fits the image because the rule would generate that image given the label. There is an inverse relation between classifiers and generators, as we’ve already seen, but outside of autoencoders the symmetry isn’t this neat.
Weatherby wants to capture two distinct ideas by saying that classifiers produce hypotheses. On the one hand, he wants to claim that the relevant judgements are epistemically uncertain, in the same way as judgments about things no living person has actually seen (e.g., historical fact). On the other, he wants to claim that they are semantically indeterminate, because we’re the ones imposing meaning on the labels (e.g., what it means for something to be classified as a cat). But treating the judgements as hypotheses that have been inferred and asserted is a clumsy way of capturing either of these features. I also think the standard model he describes is bad, not because it misdescribes the relevant inferences, but because it treats whats going on as inference in the first place.
It’s worth mentioning that there’s potential terminological confusion here, as many in psychology and artificial intelligence have come to use ‘inference’ in a very permissive way, such that what’s going on in my visual system when I spot a cat might get described as ‘Bayesian inference’. But from my perspective this simply isn’t a language-language move — there’s no sense in which we’re talking about inferential relations between conceptually articulated propositions expressed by declarative sentences, even though there is input data and output data that could theoretically be expressed in this form by a neuroscientist studying my brain. When I point and say ‘There, a cat!’ I’ve performed a non-inferential language-entry move. It’s an assertion, but it a potentially uncertain one. Whatever warrant other people have to re-assert it and draw their own conclusions from it depends on my reliability as an observer of cats. I can even withdraw this warrant if I myself am too unsure by saying ‘It looks like there’s a cat there…’, offering up an option that people can choose to believe or not to believe either way, without thereby asserting a hypothesis about anything. If anything it’s the other way around — the hypothesis that there’s a cat is the simplest explanation for why it looks like there is.
If it seems like this Sellarsian take on observational judgment treats me more or less like a cat classifier then you’re not far off. One of the examples Brandom is very fond of is chicken sexing, where people employed to separate chicks by sex can be incredibly reliable at making snap judgments but cannot reliably tell you how they do it. Studies have shown that they’re largely doing it by smell, even when they aren’t aware of it. This doesn’t really matter in most cases. Reliability is reliability. But there is an important difference between me and a classifier that’s revealed by another of Brandom’s favourite examples — talking parrots. We can train a parrot to reliably say ‘That’s red!’ whenever we show it a red object, building an ersatz biological classifier. But the parrot doesn’t understand the meaning of ‘red’ because it doesn’t understand that if something is red then it’s coloured, or that if something is red then it’s not green. It can’t get a grip on the inferential relations that make the concept semantically determinate, because it doesn’t play the game of giving and asking for reasons. This means that it’s not strictly asserting that the object is red. It’s not even saying that it looks red. We can interpret its utterances as options for moves, and its reliability might warrant us making them, but this isn’t endorsing a hypothesis. We could ourselves hypothesise that the object is red because this explains the parrot’s squawk, but that seems like logical overkill. We’re generally better off inferring that, all else being equal, when a well trained parrot utters ‘That’s red!’ when shown an object, the object is red. The same holds for non-biological classifiers.
Moving on from this explicit discussion of inference, I want to examine a part of the book in which inference is all too implicit — his account of the connection between computation and language qua sign systems exploited by LLMs — a purportedly ‘dialectical’ tension between inclusion and incompleteness. The former is the already mentioned tendency of language to ‘greedily’ absorb the contents of other systems, validating expression in the last instance. The latter is the tendency of computation to ‘reach outside’ itself by borrowing rules from other systems, aiming at truths it can’t yet prove. These systems are two sides of what Lacan calls ‘the symbolic order’, or the regime of signs which, unlike the icons that resemble what they signify (e.g., pictures signalling the presence of deer on road signs), or the indexes that are caused by what they signfy (e.g., footprints in the sand signalling nearby people), are essentially arbitrary (e.g., the various unrelated words natural languages have for water, or the divergent conventions for variable naming in programming languages).
Yet the deeper connection that makes LLMs possible emerges from the way this arbitrariness is inflected in either case — by the pure self-referentiality of the poetic function on the one hand and the explosive fecundity of numerical encoding on the other — each opening up its respective system onto an ongoing process of elaboration and discovery. Language displays an internal openness to experimental recombination (exemplified by Weaver’s ‘Constantinople fishing nasty pink’), while computation exhibits an external openness to axiomatic extension (augured by Gödel’s self-referential sentence G). This makes each inherently generative in its own way. Weatherby posits that this generative openness is a shared feature of their form — the relative autonomy of their syntax and semantics — at once proven and exploited by the transformer architecure, whose attention translates poetic structure into numerical code. There’s an undeniable neatness to this picture, but its pleasant symmetry conceals a dearth of concrete examples and some awkward ambiguities.
By far the most awkward question we might ask is ‘What exactly do you mean by “computation”?’ As you might have sensed from the last paragraph, there’s a persistent ambiguity between computation and mathematics throughout the book, with one standing in for the other without a consistent account of the difference (and thereby the relation) between them. For all that Weatherby criticises the poststructuralist failure to take formal languages seriously, his talk of computation is torn between two registers: one treats it abstractly by identifying it with mathematics (in its Lacanian guise as the opposing pole of the symbolic order), the other treats it concretely by positioning it as mathematics mediated by language — either in the form of software (following Kittler) or in the form of neural network classification/generation (as discussed above). Unfortunately, this leaves little room for the mathematics of computation — general recursive functions, Turing machines, lambda calculus, etc. — the formalisms which delimit the classical conception of computability (not to mention others which may herald its extension).
Let me sketch why this is significant. The equivalence between the mathematical frameworks just mentioned consists in identifying what can be computed, namely, a class of functions from finite sets of natural numbers to natural numbers. But each framework provides a perspective on how they are computated: by abstract composition of primitive functions (general recursion), by abstract machines undergoing state transitions (Turing machines), or by abstract expressions undergoing evaluation through rewriting (lamba calculus). You might think of this as the difference between a process and its purpose, which pull apart precisely insofar as we’re capable of constructing processes (i.e., machines/expressions) that don’t serve well-defined purposes (e.g., non-terminating Turing machines or untyped lambda terms with no normal form). You might think of this more concretely in terms of the way obscure hardware breeds weird machines with wildly unintended behaviour, or the way sloppy software deploys leaky abstractions which disguise divergences between specification and implementation. These distinctions then let us ask precisely how language might mediate mathematics in computation.
Let’s start with software. This is obviously dominated by the linguistic perspective on computational processes encouraged by lambda calculus (most notably in functional programming). However, this isn’t just because of the practice of programming uses artificial languages to implement processes, but because the broader discipline of software engineering deploys natural language to specify purposes. For example, a concrete system such as a retail payment platform is not conceived first and foremost as computing a function over natural numbers, but as facilitating a set of relationships between businesses, customers, and products. These relationships can be incorporated into the program qua evolving linguistic object by borrowing from natural language (e.g., ‘product > beverage > latte’ within an OOP class hierarchy), even if this program is ultimately compiled down into an abstract machine computing specific mathematical functions over binary input/output. Software involves a hierarchy of linguistic abstractions that mediate between human intention and machine execution in a more or less continuous fashion.
What about neural networks? Though these are built using software tools, and can be treated as mathematical functions between their input and output layers, they are more naturally viewed from the machine perspective on computational processes encouraged by Turing. This is because the function they realise evolves over the course of training while the essentially non-linguistic network architecture that facilitates this evolution persists. What about classifiers and generators? These are obviously mediated by language in some fashion, but it appears to be very different from what’s going on with software. Even when software specifications are imprecise, they are generally still explicit. But the problem a neural network is supposed to solve is largely implicit in the training data. This is the difference between trying to articulate a rule and trying to learn a norm. When these are linguistic norms implicit in labelled training data we might say that what is being computed is mediated by language just insofar as the purpose of the computation is at least partially linguistic (i.e., language-entry/language-exit). But this is discontinuous with how the computation is implemented.
What about LLMs and reasoning models? Well, this analysis starts to break down when we ascend from classifiers and generators to such general purpose systems, precisely because they don’t have a well-defined range of purposes. We may want to say that their purpose is to use language as such. But we might equally want to say that the various specific tasks they can perform — such as summarising information or assessing the coherence of an idea — use language as a means to other ends. Here language mediates every computational process, but in a way that’s anathema to the software model, rendering how anything is done largely opaque. We’ve captured linguistic abstractions with mathematical structure (what Weatherby calls the poetic heat map), and can even use novel techniques to analyse this structure, but not in any way that would be recognisable to those studying the semantics of programming languages. So even if language mediates most software and many neural nets, this doesn’t mean that language is a unitary and universal medium of computation.
It’s tempting to identify computation with mathematics if only to recover its underlying unity. This is effectively what Weatherby does in order to treat Gödel’s incompleteness theorems as the defining feature of computation. However, his treatment of these theorems leaves something to be desired. The issue is that he identifies computation and mathematics by talking about both as formal systems, and then, in line with his analysis of literary form, reads incompleteness as an excess of semantics over syntax. But formal systems are strictly speaking deductive systems, and the excess is strictly one of consequence (what propositions are implied by the axioms) over derivability (what propositions can be proven from the axioms using the available rules). Put bluntly, this is a question of (formal) inference. The syntax in question is not simply the structure of sentential units considered in isolation, but their concatenation into valid proofs. This invites questions about the corresponding computational objects. Weatherby talks of ‘rules’ that must be borrowed from outside the computational system if it is to capture new ‘contents’, but this is far too vague.
Weatherby is right to note that the histories of logic and computation are intertwined. Turing’s solution to the halting problem and consequent dissolution of the Entscheidungsproblem flow directly from Gödel’s results and frame the foundational questions of the theory of computation: what can we predict about the behaviour of programs and which problems can be solved by them? He’s also right that ‘symbolic’ or so called ‘good old fashioned AI’ (GOFAI) was strictly formalist in its attempt to model cognition on deductive systems. However, the actual connection between logic and computation is far trickier than this might suggest. Yes, there’s a sense in which everything is ultimately running on circuits governed by Boolean operations. Yes, there are custom logics we can use to reason about program execution. Yes, there’s such a thing as logic programming, which exploits a limited correspondence between proof search and program execution. Yes, there’s even a more general correspondence between proofs in intuitionistic logic and programs in typed lambda calculus (proof normalisation ~ program evaluation), which has been extended to encompass more logics and more aspects of compution, and which also forms the basis of a new approach to mathematical foundations. But none of these connections are precisely equivalent.
Crucially, even if there is some sense in which every proof is a program, this doesn’t mean that every program is a proof. We occasionally stumble on one of the latter when we make mistakes (e.g., typing errors, infinite loops, etc.), producing malformed programs where implementation and specification diverge in catastrophic ways. But there’s a vast space of possible computations, and we might best see logic as giving us tools to wall off the well-behaved processes which serve specified purposes from unruly processes which exist without purpose. And yet some of what lies beyond this wall is useful, even desireable, despite its relative unpredictability. This is most obvious in the case of non-terminating, non-deterministic, interactive, and concurrent computations, some of which are simple enough to render reasonably well-behaved and all of which turn up in real world use cases. It’s most extreme in the case of self-modifying programs, which can evolve in highly unexpected ways. Now that we’re trying to grow programs by training neural networks we’re wandering headlong into this wider space, but we’ve little choice if we want to build computational systems without well defined purposes, especially ones capable of expanding their capacities or discovering novel truths.
So, what do Gödel’s theorems tell us about computation? They tell us that we can’t automate mathematical discovery in the obvious way. We can’t simply give a program a finite set of axioms and inference rules and get it to mechanically enumerate all the theorems, because there will be theorems entailed by the axioms that can’t be proved using the rules, no matter how long it might run for. We’ll always need to add more axioms to prove these truths, which will in turn entail further truths they can’t prove. I think the better way to frame this is that mathematics lets us ask questions that can’t be answered simply using the concepts with which we pose them. For example, you can pose Fermat’s last theorem with nothing more than a grasp of integers, addition, and exponents, but Wiles’s proof requires you to also understand eliptical curves and modular forms. But once you understand these new concepts you can ask a whole new range of questions you might need further concepts to answer.
Does this mean that computation has to ‘reach outside’ itself to acquire these concepts? Well, yes and no. If by ‘computation’ you mean the sort of process that corresponds to a proof, then the best you could hope for is to feed it more axioms and run it again. This is not quite the same as a process that looks outside itself for inspiration. If there’s such as thing as a ‘computational’ process of mathematical discovery — which, given that there’s no good reason to believe we aren’t computational systems, we strongly suspect there is — then its search for inspiration would be characterised by precisely those features — non-termination, non-determinism, interaction, concurrency, self-modification — that render such processes logically ill-formed. This isn’t to say such systems are essentially illogical, just that they can’t be purely deductive. There is ultimately no procedure that can brute force mathematics.
Yet almost everything we do with computers has nothing to do with mathematical discovery. It’s more a matter of mathematical calculation. Furthermore, this is rarely calculation for its own sake (e.g., calculating the nth digit of pi), but most often the application of mathematical models to non-mathematical problems (e.g., calculating rocket trajectories). There are comparable processes of computational discovery involved in finding (and revising) these mathematical models, which have to ‘reach outside’ themselves almost by definition, and which may even be related to the process whereby we uncover new mathematical tools, even though they are distinct from it. What else are we doing when training neural nets than trying to discover a suitable (black box) model of that which is classified and generated? But Gödel’s results do not directly apply to this type of discovery. We might want to claim by analogy that there is ultimately no procedure that can brute force reality (and I do), but this is a complex problem in its own right for which the incompleteness theorems serve mostly as a suggestive symbol.
Where does this leave ‘rules’ and ‘contents’? Weatherby principally describes computational incompleteness as openness to new rules and linguistic inclusion as openess to new contents, with the two intertwined in LLMs. I may prefer to say that neural nets are learning ‘implicit norms’ rather than ‘explicit rules’, but there’s not a huge amount of difference here. Yet I think that articulating the connection between rules and contents means unravelling the pretence that we aren’t talking about cognition here, precisely insofar as the capacity to express new contents is tied to the capacity to know new truths, while computational rules breed inferential ones. This much should be obvious in the case of mathematical discovery, as it aims to expand our store of true theorems and proof techniques. But it’s less obvious in the non-mathematical case, where one might still pretend that the accumulation of new calculative methods has no such telos.
Imagine a classifier that learns to roughly estimate the weight of objects in videos on the basis of various factors, including how easily they seem to be moved about. Now imagine a generator that can produce videos of moving objects and can be told to make them lighter or heavier, resulting in them being moved about more or less easily. Both have learned an internal model of basic physics, though their implementations might differ quite significantly. Now consider human beings whose neurocomputational models enable them to observe, estimate, and manipulate objects based on weight, often in continuous loops of perception and action. Again, the neurocomputational details may differ between individuals. To make a call back to one of my minor criticisms, what we are dealing with here are what Kant would call (empirical) schema — rules for identifying, extrapolating, and manipulating patterns in sensory data.
What distinguishes concepts from schema is that — qua rules for inference — they abstract structural relationships from computational implementation. When different humans learn to talk to one another about weight, and when classifiers, generators, and LLMS are plugged into these conversations, we can be assured that we’re talking about the same thing insofar as the material inferences implicit in our linguistic behaviour are consistent with ‘if a is heavier than b, then, all else being equal, b is easier to move than a’. The same concept can be implemented by differing schema. As I’ve explained in more detail elsewhere, the true power of language lies in refining and revising these abstracted relations in ways that reframe the parochial understanding encoded in our neurocomputational models, as happens when we move from concepts of weight and effort to those of mass and force by codifying Newton’s second law. This enables us to technologically extend our capacities for observation and manipulation and cognitively ascend from talk about the weight of objects we can see and touch to talk about the mass of electrons and black holes.
This is to my mind what makes language inclusive in Weatherby’s sense — not its symbolic arbitrariness, but its conceptual extensibility. The capacity to express any content derives from the capacity to express any proposition, and this is enabled by the unlimited reach of inference — the possibility that the truth of any proposition may be relevant to any other, from unappreciated symmetries between falling apples and orbitting stars to unexpected connections between CO2 emissions and global weather patterns. Both of us see this as a capacity of the linguistic system as a whole, but I think it’s only fully realised by language qua medium of cognition — the inferential architecture mediating our ever expanding capacities for perception and action, progressively abstracting from their computational underpinnings. If there’s a corresponding computational incompleteness in Weatherby’s sense — an excess of semantics over syntax — it lies in the irreducibility of material inference to formal inference. Even the rarefied abstractions of mathematical physics ultimately require cognitive/computational contact with the world, not just to remain grounded but to drive progressive revision. We cannot operate with solely stipulative definitions any more than we can brute force discovery.
This hardly seems like the middle ground between Weatherby’s semiotic non-cognitivism and my semantic inferentialism that I initially promised, so allow me to back track by way of conclusion. I still think that Weatherby is correct to analyse LLMs as learning language as a structural sign system independently of its cognitive role. But I think this isn’t a matter of capturing language as essentially non-cognitive but as pre or proto-cognitive. Even if inference is the core of meaning, there’s more to meaning than inference, not just in the sense of language-entry/exit, but in the sense of language-language relations that aren’t strictly inferential. In Weatherby’s terms we might say that there’s also an excess of semiotics over semantics, and that this reveals how cognition is embedded in culture.
To rework an idea from the semiotic tradition, we might distinguish between implication and connotation. Just because there’s no significant difference between the inferential role of ‘that’s strange’ and ‘that’s queer’ used literally in most contexts doesn’t mean that ‘strange’ has the same meaning as ‘queer’. The latter isn’t simply a homonym for non-normative sexuality, but bears a range of complex associations that subtly warp the way discourse continues, signalling social affiliations or summoning topics to mind. Such connotations can sometimes suggest novel inferential options (e.g., ‘if x is queer then its difference from the norm is socially significant’), even if they don’t compel them, and even if they aren’t intended by the speaker. Moreover, they aren’t felt in isolation, but affected and intensified by wider patterns of word usage — the poetic relations internal to messages that Weatherby centres. These proto-inferential patterns of association offer generative inspiration to processes of cognitive discovery, should they be open to it.
The thing about LLMs qua language machines is that they don’t draw a distinction between implication and connotation. They capture one along with the other, in the same (neural) net. This is why they’ve generally been better at producing stylistic coherence than logical consistency. They simply can’t prioritise the sorts of internal relationships upon which the latter depends. We might speculate that what makes so called ‘reasoning models’ different from their more literary kin is that the chain-of-thought process helps to filter implication from connotation, and that reinforcement learning focused either on verifiable outputs or intermediary steps further intensifies this, emphasising these differences within the model itself. This isn’t to say that such models are learning distinct inference rules and sequentially applying them in precisely the way in which humans or automated proof assistants would, just that they’re shaping the generative potential of language along a recognisable trajectory from pre-cognitive to cognitive. And it’s ultimately this trajectory that Weatherby’s plea for structuralism makes visible.



I’m glad to see you’re on Substack.
This is a very strong piece that far exceeds its assignment as a review (indeed, you have sketched out the beginnings of a general theoretical foundation for LLMs). The point about newer reasoning models in particular treads carefully between knee-jerk deflation and recognition of novelty. While it essentially involves passing forward acquired context or reprompting, what seems more interesting to me is the path dependency of this process of context acquisition, which, as you note, constrains the space of possible responses going forward. It also makes it possible to discover and isolate subsequences where reasoning goes wrong. There are a couple of points with which I take umbrage:
- Human cognition is intellectual ectypus in Kantian terms, meaning that it is irrevocably constricted to cognition through representation (i.e., in the concept in the form of judgement). Empirical schemata are products of the rather fuzzily-defined faculty of the imagination. They are temporal mediations between intuitions and empirically acquired concepts. The model here makes it sound like you are directly equating learned representations—weights—with these rules for the dynamic application of the concept to intuition in time. However, the schema is a rule that applies an already-possessed concept (whether a priori or empirically acquired). LLMs approximate the generalised abstraction of the concept via brute force: scaling up parameters/empirical instances. Surely these weights are more like the mechanisms of association by which empirical concepts are first formed from repeated intuitions? Granted, Kant is less clear on this phase than on the theory of judgement, but LLMs are still operating at the level of the empirical laws of association. To suggest otherwise would be to argue that learned representations are the same thing as concepts. At least for now, they do not have that scope, that ability to generalise: the reliance on the massive scale of empirical particulars in the training set and the vast disparity in energy expenditure for the performance of equivalent linguistic tasks secure this. This seems to undermine your other point about their activity being pre-cognitive/sub-representational. I would suggest either way that because LLMs are parasitic on the substrate of human language reproduction, they are dealing with representations-of-representations (i.e., they are doubly ectypal). This seems better than attempting to map pieces of the theory of judgement to computational processes and elements in LLM architecture. Much of the philosophy attempting the latter has been radically unsuccessful. See below on the territorial transcendental/empirical confusion.
- The humanism charge, which is a bit of a canard and quite hackneyed in academic circles at this point—almost an "epistemological aesthetic"—does not land. LLMs are operating on the products of human cognitive labour. They are not being misrecognised for human output; they are recombining human output. The various attempts at philosophies of the inhuman in recent years, which remain yoked to the human by defining themselves against it, are not especially helpful in this specific instance.
- To preface this, I am not quite sure from this piece whether this is your position or Weatherby's. However, the understanding of classification here seems not a little dated, confined to hand-labelled training data (supervised learning). Self-supervised learning has been the dominant mode since BERT/GPT. There are also foundation models trained on unlabelled corpora where the question of labelling simply becomes "predict the next token". There are forms of reinforcement learning involving preference comparisons rather than conceptual classification. The classifier as human-judgement-mediated hypothesis generator has been obviated by this variety of approaches in general and self-supervision in particular. It still holds out somewhat as far as training set curation goes, but barely, and certainly does not legitimise a purely linguistic approach.
- The invocation of the Curry-Howard-Lambek correspondence seems like a bit of sleight of hand to smuggle empirical content into the a priori, which strikes me as a general problem with the neorationalist project. Computation is not substrate-independent. Otherwise, why would we have to restrict the mathematics of computer science to sequential time, effective procedures, finitary methods, or discrete state spaces? Why not pure mathematics? The very reason we use intuitionistic/constructivist logic is due to the time-bound nature of computational processes. There is a kind of notational fetishism or idealism regarding symbolic cognition here: successful models of computation are not in themselves computation. Turing machines, lambda calculus, general recursive functions, operational semantics, process calculi: these are ways of modelling computation; they are not constitutive of computation itself. CHL connects formal systems but does not make the passage to concrete computation and therefore inhibits analysis of the effects of computation in the real. The transcendental is not computational because computation is already an empirical specification, i.e., a constraint on what physically realisable systems can do in time, not a constraint on the conditions of possibility for experience as such. This move from "mental construction requires time" (Brouwer) to "all temporal processes are computational" substitutes a class of mathematical models for the structure of synthesis itself. This seems like straightforward dogmatism: tracing the transcendental in the image of the empirical.