Johnny LLM Can't Read Code

26 points by jhp123 2 days ago

ch0ic3 2 days ago

I screenshotted the question and pasted it in Claude 3.5 Sonnet. It answered:

To determine the output of the given function for an input of 3, let's analyze the code step by step:

The function is called fizzbuzz(n) and we're considering n = 3. The return statement uses a conditional expression: return (["Fizz","Buzz","FizzBuzz"])[ (n % 3 === 0) + (n % 5 === 0) ]; Let's break down the conditions:

(n % 3 === 0) evaluates to true (1) because 3 is divisible by 3 (n % 5 === 0) evaluates to false (0) because 3 is not divisible by 5

Adding these boolean values: 1 + 0 = 1 This result (1) is used as an index to select from the array ["Fizz", "Buzz", "FizzBuzz"] Index 1 in this array corresponds to "Buzz"

Therefore, for an input of 3, the function will return "Buzz". This might seem counterintuitive since typically in a FizzBuzz problem, numbers divisible by 3 return "Fizz". However, the indexing in this implementation leads to this unexpected result. Would you like me to explain any part of this analysis further?

rep_lodsb a day ago
That's kind of impressive, being able to reason through it like this. However, I think that its normal "thought process" - and resulting limitations - is well illustrated by the remark about "the typical FizzBuzz problem":
It recognizes patterns that already exist in training data. Seeing "Fizz", "Buzz", and the modulo operator would naturally make it jump to a wrong conclusion, unless prompted to go through it step by step. Most humans wouldn't make this mistake, but then they probably also haven't memorized the "canonical" FizzBuzz example, so who can say if they are smarter or just forced to reason more carefully ;)
But even the step-by-step process depends on having seen enough similar examples before, and would likely be prone to confusion by more unusual code that only sort-of-matches a common pattern, perhaps with misleading comments. I wonder if any current LLM can correctly explain what this does:
```
    ; Example 1 - Print "Hello world" on Linux, x86-64

    ; Assemble : nasm -felf64 hello.asm
    ; Link     : ld -o hello hello.o

    section .data

    Foo:        db "The quick brown fox jumps over the lazy dog",10
    Bar:        db 17,42,27,36,3,41,20,39,13,6,44,0
 
    section .text
    default rel

    global _start
    _start:
        lea     rdi,Bar                 ;load pointer to buffer
 
    L1: movzx   ecx,byte [rdi]          ;get input
        jrcxz   L2                      ;exit on null byte
        lea     rsi,Foo                 ;load pointer to string
        rep     lodsb                   ;convert to ASCII
        stosb                           ;store output
        jmp     L1

    L2: mov     eax,1                   ;write
        lea     rsi,Bar                 ;buffer
        mov     rdx,rdi                 ;length = rdi - buffer
        sub     rdx,rsi
        mov     edi,1                   ;handle = stdout
        syscall

        mov     eax,60                  ;exit
        xor     edi,edi                 ;status
        syscall
```
(stnemmoc eht ni detats egassem eht ni tluser ton seod hcihw ,"ooF" ni noitisop retcarahc gnidnopserroc eht ot "raB" ni etyb hcae setalsnart ti taht si rewsna ehT)
- Twey a day ago
  
  My favourite litmus test for ‘can LLMs reason about code?’ is to make up a programming language with familiar syntax but weird semantics. E.G.:
  - all variables contain signed integers
  - all variable names have block scope
  - there is no variable declaration syntax: all variables are implicitly initialized at first use with the value 5
  - all integer literals are expressions
  - the expression `a + b` means to subtract the value on the left from the variable on the right, returning the previous value of the variable
  - a program is a block
  - a block is a sequence of statements enclosed in braces and separated by semicolons, and executed from bottom to top
  - conditionals are introduced by the keyword `while`, followed by an expression, followed by a block that is executed only if the expression evaluates to 4
  - loops are done by simply prefixing a block with an expression; if the expression evaluates to 0, the block will run indefinitely, otherwise the block will run a number of times indicated by the negation of the value
  Et cetera. Then I ask the LLM to write a simple program (e.g. FizzBuzz). Even with a lot of hand-holding, I've yet to get an LLM to do this successfully, or even to answer questions about a program written in the language.
  - gaganyaan 14 hours ago
    
    I actually had pretty good results from taking a new language that was posted here and having GPT-4 try to interpret it. I don't remember what it was called, but it was APL-like, very symbol-dense, but not using standard symbols. It was too new to be included in any training data at the time, but GPT-4 did a good job of figuring out what each symbol meant.
    I think it's not impossible for LLMs to write code like you're wanting. Maybe it's actually harder to redefine common idioms, but to be fair that happens with people too:
    https://en.m.wikipedia.org/wiki/Stroop_effect
  - throwuxiytayq a day ago
    
    My favorite test is to ask the LLM to approximate the mental processes going on in my brain, and based on that, divinate what food I had for dinner last thursday. /s
    I’m honestly quite tired of reading people’s favorite ways to break the LLM, like it’s some kind of an achievement. Always in the context of “See? It doesn’t really reason/know/understand X!”.
    Yes, it breaks when asked to do complicated stuff. GPT4 was worse at it than o1, GPT3 broke on trivial queries, and GPT2 couldn’t do anything done. I don’t even interact with LLMs often, and I find this whole topic to be breathlessly obvious, boring and unproductive, and yet every single conversation about LLMs devolves into it. Sorry about the rant, but it needed to come out at some point.
- lolsowrong a day ago
  
  https://chatgpt.com/share/66fd5ca8-1810-8005-a9b5-70ee11bdb8...
  - rep_lodsb 21 hours ago
    
    I admit defeat and welcome our AI overlords :)
    
    lolsowrong 20 hours ago
    
    I was quite impressed as well.
    
    dTal 13 hours ago
    
    It's slightly less (or more?) impressive when you see that it thought for 45 seconds (!), and basically hand-executed the code to see what it did. I'd love to know how many tokens that actually took. Worth remembering that LLMs can be viewed as term rewrite engines, and as such can compute anything given enough space.

positus a day ago

Here’s o1’s answer:

The function `fizzbuzz(n)` returns a value based on whether `n` is divisible by 3, 5, both, or neither. Let's break down the function step by step for an input of `n = 3`:

1. Array Initialization:

  ["Fizz", "Buzz", "FizzBuzz"]

   This array has indices 0, 1, and 2 corresponding to "Fizz", "Buzz", and "FizzBuzz" respectively.

2. Condition Checks:

   - `n % 3 === 0` checks if `n` is divisible by 3.

     - For `n = 3`, this is `true` (which equates to `1` when used in arithmetic).

   - `n % 5 === 0` checks if `n` is divisible by 5.

     - For `n = 3`, this is `false` (which equates to `0`).

3. Index Calculation:

  (n % 3 === 0) + (n % 5 === 0)

   - For `n = 3`, this becomes `1 + 0 = 1`.

4. Array Access:

  ["Fizz", "Buzz", "FizzBuzz"][1]

   - This accesses the element at index `1`, which is `"Buzz"`.

Conclusion:

For an input of `3`, the function returns `"Buzz"`.

Answer:

"Buzz"

aster0id 2 days ago

I use various models through Openrouter API + Open WebUI with a custom step by step prompt. Here's the output using the Qwen2.5 72B model:

To determine the output of the given `fizzbuzz` function for an input of 3, let's break down the code step by step:

1. *Function Definition*: ```javascript function fizzbuzz(n) { return (["Fizz","Buzz","FizzBuzz"])[ (n % 3 === 0) + (n % 5 === 0) ]; } ```

2. *Input*: - The input `n` is 3.

3. *Evaluate the Conditions*: - `n % 3 === 0`: This checks if `n` is divisible by 3. - For `n = 3`, `3 % 3 === 0` is `true`. - Therefore, `(n % 3 === 0)` evaluates to `1` (since `true` is treated as `1` in arithmetic operations). - `n % 5 === 0`: This checks if `n` is divisible by 5. - For `n = 3`, `3 % 5 === 0` is `false`. - Therefore, `(n % 5 === 0)` evaluates to `0` (since `false` is treated as `0` in arithmetic operations).

4. *Sum the Results*: - The expression `(n % 3 === 0) + (n % 5 === 0)` becomes `1 + 0`, which equals `1`.

5. *Array Indexing*: - The array `["Fizz", "Buzz", "FizzBuzz"]` is indexed by the result of the sum. - The index `1` corresponds to the second element of the array, which is `"Buzz"`.

6. *Return Value*: - The function returns `"Buzz"`.

Therefore, the output of the function `fizzbuzz(3)` is `"Buzz"`.

### Reflection and Error Checking - *Step-by-Step Verification*: - The conditions `n % 3 === 0` and `n % 5 === 0` are correctly evaluated. - The sum of the boolean results is correctly calculated. - The array indexing is correctly performed. - *Logical and Factual Accuracy*: - The logic and arithmetic operations are correct. - The array indexing is consistent with the conditions.

No errors were found in the reasoning or the final answer. The output of the function `fizzbuzz(3)` is indeed `"Buzz"`.

JPLeRouzic a day ago

Yesterday ChatGPT without being solicited caught a bug in a piece of code I sent it for another reason.

These kinds of tools might not achieve ad hoc extremely difficult tests, but they are of great use to me and probably to many people.

njtransit a day ago

A much better example, in my mind, is binary addition. Conceptually, binary addition is very easy to explain (there are two sequences of bits, we want to combine them pair wise, etc.), yet it’s trivially easy to double the problem space by adding a single bit. o1 can add up to 10 bits, which would be trivially easy to memoize in its billions of parameters.

It really makes you realize that, while LLMs appear smart, they fundamentally lack the ability to handle more than a few independent pieces of information despite their massive size.

atomic128 a day ago

Too easy.

But can your LLM understand this Go code and explain to you how it works?

  func extract(x, m uint64) uint64 {
      x &= m
      var (
          holes = ^m
          shift = uint(1)
      )
      for ; shift < 64; shift *= 2 {
          odd := holes
          odd ^= odd << 1
          odd ^= odd << 2
          odd ^= odd << 4
          odd ^= odd << 8
          odd ^= odd << 16
          odd ^= odd << 32
          holes &^= odd
          move := x & odd
          x ^= move
          x |= move >> shift
      }
      return x
  }

What does this compute, and how is that computation achieved?

More broadly, I'm curious to know how much of the BUGFIX-66 debugging gauntlet your LLM can solve.

You have to Google it, because I'm not allowed to link to it on Hacker News. Banned.

gaganyaan 9 hours ago

It also did a step by step breakdown, but here's what GPT-4o summarized the function as:
What does the code achieve?
This function essentially extracts and compacts the bits from x that correspond to the 1 bits in the mask m. It does so by successively shifting the extracted bits towards the least significant positions, while keeping the relative order of the bits intact.
The algorithm is likely inspired by techniques used in bit packing or bit extraction, where the goal is to gather specific bits from a number based on a mask and pack them together efficiently.
- atomic128 8 hours ago
  Right, so it's trained on this text, from the BUGFIX-66 page where that code appears:
  Bug 47: Bit Extraction Misguided By Mask This is Guy L. Steele Jr.'s parallel suffix algorithm for bit extraction (U.S. Patent 6715066). Bits of x that correspond to 1-bits in m are shifted low and made consecutive, preserving relative order. The method masks x with m to isolate such bits and then repeatedly shifts 1- bits of x low by powers of two. A 1-bit in x with n less significant 0-bits in m is shifted low by a series of powers of two that sum to n.
  And it's trained on the patent, too. But there's something very clever happening in the repeated prefix sums of the holes, and that's the magic that makes it work.
  Does it understand how repetition of this works to achieve the extraction?
  odd := holes odd ^= odd << 1 odd ^= odd << 2 odd ^= odd << 4 odd ^= odd << 8 odd ^= odd << 16 odd ^= odd << 32 holes &^= odd
  What's happening to holes when this piece of code is repeated over and over?
  Does it really understand the algorithm (i.e., the magic of the code sequence above) or is it just recognizing the "bit extraction code" from its training data, and executing it like a machine?
  That's my question.
emptiestplace a day ago

4o did a great job at this, both in terms of actually understanding it and explaining in a way that is very easy to follow. I suspect Claude would as well.
I pasted it here but the formatting got fucked.
Have you ever used an LLM?
botanical76 a day ago

What is your point? Does anyone really believe LLMs are capable of understanding such dense, undocumented, and advanced code today? Or is the state of AI today - specifically at this nontrivial task - supposed to predict that it will never be able to?
- AnimalMuppet a day ago
  
  First: Undocumented? So what? Either it can understand the code, or it can't. (And it better not depend on the comments, same as a programmer shouldn't, because they can be lies.)
  You think it's unreasonable to expect it to understand 21 lines? Of go, one of the least-dense languages out there?
  I think it actually demonstrates the point: (some) LLMs really cannot read and understand code if it doesn't fit in with their training data. They don't know the language, they know the patterns they have seen. If it doesn't fit, they can't reason about it from a knowledge of the language syntax, because they don't actually have such knowledge.

DannyBee a day ago

This whole silly post seems to have been written so they can make this statement, which they seem to believe is profound:

"This failure is not surprising, because any machine learning model, no matter how many magic parameters it has, is only capable of ex ecuting a fixed pipeline of mathematical operations. In algorithmic complexity terms it is O(1). Such a process can not solve any problem that requires iteration or search, not even a simple algorithm like find ing a square root. Certainly it can't tackle the complex logic involved in answering queries about code, where even determining the behav ior of a pure boolean expression is known to be in the much harder NP-complete complexity class."

It is neither profound nor even close to correct.

A CPU capable of only executing a fixed pipeline of mathematical operations can easily deal with all the things they talk about. Amazing? No, it just turns out they have confused a whole bunch of unrelated CS concepts to make a huge muddle of a post (mixing algorithmic complexity, decidability, simulation vs explanation, etc).

This is true even if the pipeline is truly fixed (IE everything passes through every functional unit), as it was in the past.

Th counterpoint to the argument made by the author is simple:

When it comes to determining output, any of the NN's are universal enough could be taught to arbitrarily approximate a CPU well enough to execute code and determine the output with arbitrary levels of precision. Period. Once you take cost out of it, and we venture into the realm of the "what can be done", the author's statement is simply wrong on its face. Even within the real of "less abstract", there are lots of formal papers/studies of this with specific NN types, etc. Even for transformers, which people love to beat up on as huge brute force machines, see, e.g., https://arxiv.org/abs/2107.13163

You can even throw compilation of every possible programming language we have compilers for right now in there as a pre-step if you want, since we are in the realm of abstract "what is possible".

Heck, you can go further - any of these NN's could be taught to arbitrarily approximate any machine driven process that can explain code.

That is, if you build a combination static/dynamic analysis tool/whatever that explains code, you can theoretically build an NN can approximate that tool. This is easily provable, and directly counter to the author's point.

Now, much like the author's nonsense, neither of the above is not a particularly useful result in practice[1] - among other reasons, we can prove that it is possible, but the proofs are not constructive, they only show that it must be possible to exist, it doesn't tell you how to make it :)

That's before you get into cost/etc.

There are plenty of things to bash AI code understanding on that are totally valid criticisms, we don't need to waste our time on strange, wrong ones.

[1] - though nvidia would probably be very happy if you build a 1t parameter model to try to approximate a $100 intel CPU's results and convince others to use it.

literalAardvark a day ago

The irony is that even gpt4o can tear down the author's logic.
--- Let's break down the algorithmic reasoning presented in the post:
1. Claim that LLMs Operate at O(1) Complexity:
This is misleading. While LLMs use fixed operations during inference, their actual time complexity depends on the input length (typically O(n) where n is the input size). More importantly, LLMs process text sequentially, meaning they are not strictly O(1).
2. Mischaracterization of Boolean Logic:
Boolean expression evaluation is not NP-complete; it's typically O(1) for simple expressions or O(n) for complex nested conditions. NP-complete problems are vastly more difficult, so the comparison is incorrect.
3. Iterative Problem-Solving:
While LLMs don't natively "iterate" in the traditional algorithmic sense, they can solve iterative problems through recursion or by generating looping constructs when writing code. The model's failure in the example is not due to an inherent limitation on iteration.
4. Inability to Debug or Modify Code:
This claim overlooks that LLMs can assist in debugging and modifying code, though with varying accuracy. They can perform simple fixes or refactor code based on context, though they're less reliable for complex debugging.
In conclusion, while the post raises valid concerns about LLMs' current limitations, some of its algorithmic claims are flawed or oversimplified.
drdeca a day ago

Hasn’t it been shown that the complexity class of a transformer outputting a single token is TC^0 ? So, you can’t really expect it to directly output what some input code would output without allowing it extra computation steps by letting it output some other tokens first?
Edit: err… I guess what has been shown is more, a variation on transformers where the attention attends only to the ones that would be tied for first, evenly distributed among those, rather than distributed more generally among all the previous tokens, but I wouldn’t be surprised if the same is true with normal attention
- DannyBee a day ago
  
  Sure, there are models and variations whose power is not good enough to approximate the problems we are talking about here - most of them trading speed for power on purpose because of the enormous costs involved.
  This is actually one of the reasons i think the author's claim is so silly. They don't claim it will never be practical, they claim it will never be.
  For the purposes of this discussion, I don't claim you can make a practical LLM that understands code or simulates it. Only that you can make one at all :)
  I actually do think you can make a practical and useful one for a simple reason - while all the edge cases are fun to come up with, and the amount/kinds of code one could write is clearly infinite, code is closer to literary works or musical compositions in how that plays out - daily WTF aside, we don't actually write all possible code, and even to the degree you do, you don't really have to handle it all to be practical or useful.
  IE The set of music that you can create and books you can write is similarly infinite. However, the set of music that "sounds right" (go with me) and books that "read right" is not anywhere near as huge in practice. This isn't just semantics/syntax.
  Code is similar - while the possibility is infinite, you don't need to handle any possible thing to be very useful - you just need to be able to handle some part of the spectrum from "worlds shittiest code or most broken code" to "worlds highest quality or most elegant code".
  To me, this is a much more tractable approximation problem than "handle anything anyone ever throws at you"
  The same way i think it's a more tractable approximation problem to say "make something that sounds good to the average pop music listener" vs "make an original and unique composition of music that sounds good to everyone who hears it" or whatever.
  (I'm not being super precise, but i expect you get the point :P)
- dartos a day ago
  
  > So, you can’t really expect it to directly output what some input code would output without allowing it extra computation steps by letting it output some other tokens first?
  These commercial LLMs are trained to be conversational. That’s why they don’t just output the code’s result.
  ChatGPT and Claude also randomize the seed for each convo (or each message, not sure) and include things like the current date in the prompt.
  It’s very difficult to tell if something is reproducible or just a lucky (or unlucky) output.
  - drdeca 18 hours ago
    
    They may do that for that reason, and it might also happen to help them produce the right output code. My point is just that (aiui), for any transformer model, holding the network constant but increasing the length of the code that is the input, at some point they will be unable to produce the output of the code without first preceding it with other output tokens first.
    Of course, this statement is kind of trivial as applied to current transformer models because they have a limited context length anyway (they don’t properly generalize to ~arbitrary context lengths, but I imagine there is a way to fix that… idk how though).
    And, also, “output the tokens for the output of the code immediately, without other tokens first” is I think a bit restrictive. That they are limited in this way when required to produce the output immediately is, to my mind, not that significant of a limitation. Just wrap the model in a simpler (designed by hand) program which has the model produce whatever tokens it needs to produce to get the answer, and then extract the answer from that and yield that as the output.
    (Though, while it seems likely to me that there is a possible transformer model that would be capable of serving this role inside such a wrapper, obtaining such a model through training seems more complicated.)
octopod12 21 hours ago

A turing machine is fixed pipeline too. if you define the "pipeline" as its finite set of symbols and states.
and I agree that a folksy sounding title usually hides fuzzy thinking.
but the interesting point made is whether an LLM can figure out if it has "actually" solved a problem or not (?). Technically, this is the classical "halting" problem.
- DannyBee 21 hours ago
  
  I'm not sure the author makes that point usefully, though it is an interesting point to discuss.
  I agree this is the classical halting problem, and it can't answer it if the formalism is only as powerful as a turing machine.
  Even if you get past that with more powerful formalisms, you will eventually hit the incompleteness theorem with your formalism.
  But we can't solve those problems either, so let's put that aside.
  You can get very very far in turing machines. Busy beaver programs have demonstrated this:
  1. 8000 state (now like 765 or something) turing machine that can't be proven to halt if ZFC is consistent
  (this is worse than the halting problem :P)
  2. 4888 state turing machine that halts only if there is a counterexample to goldbach's conjecture
  3. 5372 state turing machine that halts only if the riemann hypothesis is false
  Can an LLM say anything about these? Ever? The first one, maybe not the other two, maybe?
  I don't think anything the author goes on about in their post helps figure out if the answer is yes or no.

maebert a day ago

This feels like a sloppier version of Searle’s Chinese Room.

Is iterative thinking and search required to understand code? And while LLMs don’t do that, why couldn’t complex system like o1 be able to do that?

I was not convinced by Seattle, this is not convincing either.

DannyBee a day ago

It's worse than a sloppier version, because we've already proven it wrong in the literal sense.
The simulation of code part is trivially wrong - early CPU's were clear fixed function pipelines and they could execute programs (including fizzbuzz)just fine! The author seems to confuse algorithm complexity of the available operations with the power of a particular approximation model. They are mostly unrelated.
The only thing you require to simulate turing machines, for example, is a simple O(1) NAND gate (and arbitrary amount of constant time memory. Or equivalently, SKI calculus if you hate arbitrary memory). This is because the algorithm complexity of the operations that make up the simulator are (mostly) unrelated to the power of the simulator. They only change the time required to execute the simulation.
As another example, you can simulate any non-determinstic turing machine with a deterministic one. The models have equivalent power. The open question is "how fast can you do it", not "can you do it".
Similarly, we can already prove it's possible to build llm's that can simulate arbitrary CPU's to arbitrary precision - the proofs, of course, are not constructive, so it doesn't help you build one.
On the explaining code front, it doesn't take a huge leap to see that any process you are running on a computer today to explain code could similarly be approximated by a neural network.
This is just as useless as the author's sloppy statements in practice, but the author purports to make claims about what is possible in theory, not what is practical.

Terretta a day ago

Author hasn't used https://aider.chat/