In this essay to CS grad students, Kristopher Micinski, at the end makes an interesting observation -

there are plenty of domains where Claude Code completely fails right now–paren matching in Racket is only one example. 

I had noticed this exact phenomenon so many times; when it would struggle with balancing parentheses and get stuck for quite a while. It baffled me. After all, how hard is balancing parentheses?

When I posed this question to Gemini, this is response I got -

The Issue: Most training data (GitHub, etc.) contains code with shallow nesting (3–5 levels).
The Result: If a Racket function requires 8 or 10 levels of nesting, the model enters "out-of-distribution" territory.

Also, they don't use stacks and don't have a reliable way to count.

Hopefully, going forward, projects like Calva Backseat Driver will solve this problem.