Alibaba’s latest model, QwQ-32B-Preview, has gained some impressive reviews for its reasoning abilities. Like OpenAI’s GPT-4 o1,1 its training has emphasized reasoning rather than just reproducing language. That seemed like something worth testing out—or at least playing around with—so when I heard that it very quickly became available in Ollama and wasn’t too large to run on a moderately well-equipped laptop, I downloaded QwQ and tried it out. I also tried a few competing models: GPT-4 o1 and Gemma-2-27B. GPT-4 o1 was the first model to claim that it had been trained specifically for reasoning. Gemma, as far as I know, makes no such claim.
How do you test a reasoning model? It’s not as easy as just checking facts. I thought, “Maybe I could ask it to prove Fermat’s Last Theorem,” but I quickly realized that the only response I’d be able to evaluate was “I can’t prove this.” (Which was, in fact, the response I received from all three models.) I can’t prove Fermat’s Last Theorem either, nor do I claim to understand any of the massive proof that mathematicians have accepted. If QwQ sputtered out a few thousand tokens of “proof,” I could fairly confidently say that its reasoning was flawed—just because I know the actual proof is much longer, a hundred pages of dense mathematical reasoning. But that’s hardly a valid test.
Learn faster. Dig deeper. See farther.
What about Fermat’s Little Theorem? That’s a lot easier, but it’s a textbook example, so whatever the result, I’d never know whether QwQ reasoned correctly or if it just managed to parrot something from its training set. There are more than a few math textbooks online, and it’s fair to assume that all of them are in the training data.
So I tried a slightly more obscure example: Prove that if a number consists only of ones (like 11,111,111) and is prime, the number of digits (all 1s) must also be prime. I’ve managed to prove this, but it’s not a problem that’s famous enough to attract textbook attention. So let’s go! Here’s QwQ’s proof:
>>> Can you prove that if all the digits of a prime number are 1,
then the number of digits must be prime?
Let’s tackle this problem step by step. So, the statement is: If all the digits of a prime number are 1, then the number of digits must be prime. Hmm, interesting. Let’s break this down.
First, let’s understand what we’re dealing with. We have a prime number where all its digits are 1. For example, numbers like 11, 111, 1111, etc., but only considering those that are prime. We need to prove that the number of digits in such a prime number must itself be a prime number.
Let me think about this. Let’s denote the number of digits by
\( n \)
. So, the number is composed of\( n \)
ones. Such numbers are known as repunit numbers. A repunit is a number consisting solely of ones in its decimal representation. For example,\( R_2 = 11 \)
,\( R_3 = 111 \)
,\( R_4 = 1111 \)
, and so on, where\(\ R_n \)
denotes a repunit of length\( n \)
.Our goal is to show that if
\( R_n \)
is prime, then\( n \)
must be prime.Let’s consider the contrapositive of this statement. The contrapositive of “If
\( R_n \)
is prime, then\( n \)
is prime” is “If\( n \)
is not prime, then\( R_n \)
is not prime.” If we can prove the contrapositive, that will establish the original statement.
…and so on, until…
**Final Answer**
\[ \boxed{\text{If } R_n \text{ is prime,
then } n \text{ is prime.}} \]
I’ve spared you the (very) chatty proof—if you want to see it, put QwQ to the text yourself. Suffice it to say that it was correct. Both GPT-4 o1 and Gemma-2-27B gave me the same proof but with less chatter. Gemma skipped some steps that a human might consider obvious.2
Next, can we go a step farther and generalize?
>>> does this proof hold for
bases other than base 10?
That’s an interesting question! Let’s explore whether this proof can be extended to bases other than base 10.
In base
\( b \)
, a repunit of length\( n \)
is defined as:\[ R_n^{(b)} = \frac{b^n - 1}{b - 1} \]
…and so on…
Again, try it yourself—and again, QwQ proved it correctly. This second proposition isn’t as likely to show up in textbooks or other resources; it’s less of a proof than an observation that the original proof didn’t make any assumptions about the base.
When I asked GPT to prove the same theorem, I got a very similar (and correct) proof, stated more formally and with less color commentary. That isn’t particularly surprising, since GPT has also been trained to reason. I was more surprised to see that Gemma-2-27B also gave me a correct proof. Gemma has been trained on mathematical texts but not specifically on “reasoning.” (Perhaps Google’s marketing never thought to call this training “reasoning.”) Gemma omitted some of the steps—steps a regular human would probably omit as obvious but that a mathematician would write out for completeness. (Just to make sure, I asked GPT to confirm that Gemma’s proof was correct. It agreed.)
Have we proven that training models to reason “works”? Well, we can’t claim to have proven anything on the basis of one successful trial—or, for that matter, on the basis of an extremely large number of trials. (In case you’re wondering, Gemma-2-7B, an even smaller model, failed.) But we have learned something very important. Think about the size of the models: OpenAI has said nothing about the size of GPT-4 o1, but it is rumored to have over a trillion parameters. QwQ weighs in at 32 billion parameters, and Gemma-2-27B at 27 billion. So QwQ and Gemma2 are between nearly two orders of magnitude smaller than GPT. Furthermore, GPT runs on what must be considered one of the world’s largest supercomputers. We don’t know the size, but we do know that OpenAI’s infrastructure is massive and includes a large percentage of the world’s high-end GPUs. QwQ and Gemma ran happily on my MacBook Pro. They made the fan spin and sucked down the battery but nothing extraordinary. Granted, GPT is serving thousands of users simultaneously, so it isn’t really a fair comparison. But it’s important to realize that GPT isn’t the only game in town and that models running locally can equal GPT on nontrivial tasks. Most people who have experimented with running models locally have come to similar conclusions, but think about what this means. If you’re building an AI application, you don’t have to tie yourself to OpenAI. Smaller open models can do the job—and they’ll shield you from OpenAI’s bills (and inevitable price increases), they’ll let you keep your data local, and they’ll leave you in control of your destiny.
What else can we learn? I have wondered how a language model can be trained for logic; my intuition said that would be a harder and more complex problem than training it for language. My intuition was wrong. I don’t know how these models were trained, but I now think that producing logic successfully is, in many ways, simpler than generating language. Why? QwQ’s verbosity gives us a big hint: “Let’s consider the contrapositive of this statement.” A contrapositive is simply a logical pattern: If A implies B, then not B implies not A. What other logical patterns can we think of? Syllogisms: If A implies B and B implies C, then A implies C. Proof by contradiction: To prove that A implies B, assume that A implies not B and show that assumption is false. Induction: Show that if A(n)
implies B(n)
, then A(n+1)
implies B(n+1)
; then show that A(0)
implies B(0)
.
It would be easy to grow a much longer list of patterns. There are better notations to represent these patterns, but a longer list and better representations aren’t important here. What’s important is to realize that these are patterns—and that composing logical patterns into logical statements or proofs isn’t fundamentally different from composing words (or tokens) into sentences. Is pushing patterns around the essence of logic? That’s not a fair question: It’s logic if you do it correctly, illogic if you don’t. The logic isn’t in the patterns but in knowing how to assemble the patterns to solve problems—and the process of assembling patterns has to be the focus of training, looking at millions of examples of logical reasoning to model the way patterns are assembled into wholes. Any of these logical patterns can lead you astray if you’re not careful; it’s easy to construct false syllogisms by starting with premises that are incorrect. I don’t expect logic to cure the problem of hallucination. But I suspect that training a model in logical patterns is a better way for the model to “learn” logic than simply training it on words (human utterances). That’s the bet that OpenAI, Alibaba, and possibly Google are making—and they seem to be winning.
Can we go further? Are there other kinds of patterns that language models could be trained on? Yes. Generative AI has proven useful for generating code but hasn’t (yet) made significant inroads into software design. Could training models specifically on design patterns be a breakthrough?3 I don’t know, but I’d like to see someone try. A model specialized for software design would be worth having.
Could we do better with generative music if we trained models on the patterns analyzed in music theory, in addition to audio? Applications like Suno are a lot of fun, but when you get down to it, they’re just repeating the clichés of common musical styles. Would it help to give Suno some knowledge of music theory, knowledge of the patterns behind music in addition to the music itself? Would language models write better poetry if they were trained on the patterns found in poetic language (rhetorical devices, figurative speech) rather than just words? One of my first experiments with generative AI was to ask GPT-3 to write a Petrarchan sonnet, which has a different structure from the more common Shakespearian sonnet. GPT-3 and its contemporaries failed. It was a long time before I found a model that could do that successfully; although most models could define a Petrarchan sonnet, they could only generate Shakespearean sonnets. That generation of models was trained only on the words, not the larger patterns.
Is this a way forward for AI? I don’t know, but I’d like to see AI researchers try. In the meantime, though, it’s enough to realize that, powerful as the GPT models are, you can run small open models on a laptop or a phone that perform equally well.
Footnotes
- I tested on the Preview, which has now been promoted to GPT-4 o1. I did not retest with the final o1, which presumably has had further training and optimization.
- I later decided that it was unfair to omit Claude. So I tried Claude’s small Haiku model (8B parameters) and got a correct proof. Of the group, it was the most terse and required the most thought to see that it was correct.
- Design patterns are generally associated with object-oriented design, but the concept is really more general. Design patterns attempt to name for solutions to problems that you see every day; naming the solution allows you to talk about it. That definition is applicable to any discipline, including functional programming and (of course) architecture.