Solving 6/10 days of advent of code with GPT-4

Each Christmas, some friends and I have a go at solving Advent of Code (AOC) a well-thought-out set of 25 Santa-themed programming problems.

Advent of Code has 25 days (one for each day of an advent calendar!). Each day has two parts, part 1 and an extension part: part 2. The problems escalate in difficulty as the days go on.

I'd previously attempted to solve days 1-5 with Chat GPT (GPT-3.5) with mixed results. On Friday night, I got access to the GPT-4 APIs and decided to give it another go.

The Rules

I wanted to write a program that solves an Advent of Code problem like a human.

It takes two inputs:

  • problem.txt - The problem description. Copied and pasted from the AOC website and otherwise unedited.
  • puzzleInput.txt - The puzzle input for the problem.

The program also has access to:

  • solution.txt - The solution to the problem. The solution is never sent to GPT and is only used to check the output of the program. No information about 'closeness' of the answer is sent to GPT.

I wanted the program to be generic and not tweaked for each specific problem.

GPT's solution is run inside deno like this:

deno run --allow-read=src/adventOfCode/days/${day} src/adventOfCode/days/${day}/index.ts

This should (hopefully) prevent GPT from rm -rf /ing my machine, or making network requests and starting to make too many paperclips.

Strategy 1 - Hole in one

This strategy is simple: We present GPT with the whole problem, and ask it to solve it.

We then run the program it produces and check the output against the solution.

Let's run it on day 1 part 1:

pnpm exec tsx src/solve.ts \
  --day 01.1 \
  --model gpt-3.5-turbo \
  --solve-prompt solve/noTemplate/2

=> Day 01.1 result: ✅   Attempts: 1  Time Taken (secs): 14   GPT Requests: 1   Cost: $0.002012

14 seconds and $0.002 to solve day 1 part 1. Not bad! This would have ranked you at the top of the Advent of Code leaderboards in a pre-GPT world.

The prompt (the text we send to GPT)

The prompt is simple:

Here is a coding problem I am trying to solve:

{{{problem}}}

You must solve {{partDescription}} the problem using Deno and Typescript.

You can read files in Deno like this:

import * as path from 'https://deno.land/std@0.101.0/path/mod.ts'

const readRelativeInput = (importUrl, inputFile) => {
  const dirname = path.dirname(path.fromFileUrl(importUrl))
  const filePath = path.join(dirname, 'data', inputFile)
  return Deno.readTextFileSync(filePath)
}

The puzzle input is stored at ./data/puzzleInput.txt

The last thing your solution should do is print the answer to the puzzle on a new line like this: console.log('\n', answer)

{{{partDescription}}} it to make it say: "You must solve part 2 of the problem..." if it's a part 2 question.

You can probably spot that this wasn't my first iteration of this prompt. I had to steer GPT in a couple of places:

  • GPT was having problems reading files in Deno (I suspect because Deno was still quite new around its summer 2021 knowledge cutoff). So we give it a little help.

  • console.log('\n', answer) was added because sometimes GPT decides to print something like: "Solution: 1234", which was annoying to parse. So we ask it to print the answer on a new line.

Solution

Here is the output for the run.

The solution looks pretty good, and GPT writes a nice explanation of the solution along with it:

Here's one possible solution to the problem:

import * as path from 'https://deno.land/std/path/mod.ts'

const readRelativeInput = (importUrl: string, inputFile: string) => {
  const dirname = path.dirname(path.fromFileUrl(importUrl))
  const filePath = path.join(dirname, 'data', inputFile)
  return Deno.readTextFileSync(filePath)
}

const input = readRelativeInput(import.meta.url, 'puzzleInput.txt')

const elfCalories = input.trim().split('\n\n').map((elfInput) => {
  const items = elfInput.trim().split('\n').map(Number)
  const totalCalories = items.reduce((sum, calories) => sum + calories, 0)
  return totalCalories
})

const maxCalories = Math.max(...elfCalories)

console.log('\n', maxCalories)

First, we read the puzzle input file using the readRelativeInput function. We then split the input into groups of lines separated by blank lines using split('\n\n'). For each group, we split the lines into numbers using map(Number) and calculate the total calories using reduce((sum, calories) => sum + calories, 0). We store the total calories for each elf in an array called elfCalories.

Finally, we find the maximum value in elfCalories using Math.max(...elfCalories) and print it to the console.

Note that we don't need to sort the input or keep track of which elf has the maximum calories, since we only care about the maximum value itself.

You'll notice GPT has written a mix of text and a code block with the actual solution. This meant I had to find the code blocks and extract the solution from them.

Stategy Performance

GPT-3.5: This hole-in-one strategy works for day 01.1 and 03.1. Sometimes it works for 01.2 and it doesn't seem to work for any other day.

GPT-4 does better: It can solve: 01.1, 01.2, 02.1, 02.2, 03.1, 03.2, 04.1, 04.2, 08.1

05.1 GPT-4 finds very hard and the code it generates seems to error / be way off. 06.1 it is very close to solving since it's answers are typically only off by 1, 2 or 3.

For each solve GPT-4:

  • Takes ~24 - 40 seconds
  • Costs ~$0.04-0.07 (depending on the number of characters (tokens) in the prompt and response)

This is an impressive feat.

Imagine asking a human: "Hey could you solve this programming challenge for me, you've only got one chance to run your answer". Most programmers will fail this task, and they certainly wouldn't do it in 40 seconds and charge you $0.07 to do it.

So, can we get it solve day 5, 6, 7, etc.?

Strategy 2 - Debug the solution

GPT-4 is very close to solving day 6 part 1. Perhaps by letting GPT see the output of it's solution it will be able to solve it on the 2nd or 3rd run.

pnpm exec tsx src/solve.ts \
  --day 06.1 \
  --model gpt-4 \
  --solve-prompt solve/noTemplate/3 \
  --solve-failed-prompt failedSolve/5 \
  --solve-failed-attempts 3

The initial prompt now includes an extra line:

Your solution should print console.log statements to help me understand what is happening as the code runs.

Encouraging the model to print out it's intermediate steps.

It then runs the solution and then asks GPT to debug it's solution:

Here is an solution that produces the wrong answer.

{{{codeBlock}}}

Here was the output from running it:

{{{solveOutput}}}

{{#stdOutUnder1000Chars}} Please add more console.log statements to help me debug. {{/stdOutUnder1000Chars}}

Please provide a full, updated solution in Deno and Typescript.

This strategy seemed pretty ineffective.

Even though GPT was off by only 1 or 2 for 06.1, it still couldn't solve it.

I saw some instances where GPT did iterate to the correct solution, it was generally pretty rare.

Strategy 3 - Break the problem into smaller problems

Any decent engineer knows that breaking big problems into smaller problems is a good idea. So let's try getting GPT to do the same thing.

This strategy is related to chain of thought prompting. Getting GPT to show it's working by printing out it's intermediate steps, which seems to help the end result.

Now we inject an extra prompt earlier in the conversation.

Here is a coding problem:

{{{problem}}}

Please explain an approach you could use to solve {{partDescription}} the problem. Explain how the solution works and break the small steps involved.

The next prompt is the same as before in strategy 1, but without mentioning the problem again (to save $!)

pnpm exec tsx src/solve.ts \
  --day 06.1 \
  --model gpt-4 \
  --solve-prompt solve/noTemplateNoProblem/1.1 \
  --pre-solve-prompts chainOfThought/5

=> Day 06.1 result: ✅ Attempts: 1 Time Taken (secs): 61 GPT Requests: 2 Cost: $0.0999

GPT-4 then solves: 06.1, 06.2

The explainations and reasoning it gives are pretty good / convincing.

For 06.1 it replies:

To solve this problem, we can use a sliding window approach. The solution works by maintaining a window of size 4 and checking if all characters in the window are distinct. If they are, we have found the start-of-packet marker, and we can return the position. If not, we slide the window one character to the right and repeat the process.

Here are the steps involved in the solution:

  1. Initialize a variable position to store the current position in the datastream buffer, starting at 0.

  2. Create a function is_distinct that takes a string of length 4 as input and returns True if all characters in the string are distinct, and False otherwise. This can be done by converting the string to a set and checking if the length of the set is equal to 4.

...continues

It's interesting that just talking through the problem changes the quality of it's solution.

Strategy 4 - Simple examples

As a human, I'll often write down and few examples to see if I can get a feel for the problem. Let's try getting GPT to do the same thing:

Here we're adding an extra prompt after the chain of thought prompt, but before the solve prompt like this:

List some examples which cover various cases and their expected outputs. Explain how you would solve each one.

pnpm exec tsx src/solve.ts \
  --day 09.1 \
  --model gpt-4 \
  --solve-prompt solve/noTemplateNoProblem/1.1 \
  --pre-solve-prompts chainOfThought/5 \
  --pre-solve-prompts examples/2

=> Day 09.1 result: ✅   Attempts: 1  Time Taken (secs): 188   GPT Requests: 3   Cost: $0.24933

Solves 09.1 (for only $0.25!)

This feels like another 'chain of thought' strategy. We're making GPT generate more output which will subsequently help it solve the problem in the next prompt.

Strategy 5 - Examples + Debugging

Combining our Example strategy with the debugging one we can get the model to run some examples.

We add this line to the solve prompt:

Store the examples in a const examples: string[]. Your solution should first run the examples and print the answers to the console.

We then retry 3 times and debug.

pnpm exec tsx src/solve.ts \
  --day 05.1 \
  --model gpt-4 \
  --solve-prompt solve/noTemplateNoProblemExamples/3 \
  --pre-solve-prompts chainOfThought/5 \
  --pre-solve-prompts examples/2 \
  --solve-failed-prompt failedSolve/5 \
  --solve-failed-attempts 3

This costs ~$0.50 per run and didn't particularly help.

Conclusions

GPT-4 was able to solve ~6/10 of the first 10 days of Advent of Code 2022. I found this quite surprising (I was hoping to get maybe 3-4). It's definitely a significant step up from Chat GPT.

It solves the problems in a completely generic way, with no real cheating or hacks. I spent a weekend on this project and barely scratched the surface of the prompt engineering that could be done here, and I got 6/10.

Chain of thought approaches seem to give the best results, but it's not particularly good at debugging / correcting its own mistakes (there is a dedicated editing API, which might do better in future though).

$0.25 to solve a reasonably complex (albeit quite narrow) problem in 3 minutes is already very impressive.

It feels inevitable that the costs will come down and the speed increases with the march of technology. If this technology continues to advance with similar bumps in performance from Chat GPT to GPT-4 every couple of years, it's hard to imagine what that future will look like by 2030.

You can find all the code for this project on github: github.com/richardgill/advent-of-code-gpt

Future improvements

Multiple strategies

I noticed you can prompt the model to give you multiple approaches to solve the problem. Sometimes these ideas look quite different from one another.

Prompt:

Please explain 3 different approaches you could use to solve the problem. For each approach explain how the solution works and break the small steps involved.

GPT replies:

Approach 1: Sliding Window

  1. Initialize a variable window to store the most recent four characters.
  2. Iterate through the datastream buffer.
  3. For each character, add it to the window and remove the oldest character if the window size is greater than 4.
  4. Check if all characters in the window are unique. If they are, return the current index + 1 (since the index is 0-based).
  5. If the loop ends without finding a marker, return -1 (or an appropriate error message).

This approach works by maintaining a "sliding window" of the most recent four characters. It checks if the characters in the window are unique, and if so, returns the position of the marker. The time > complexity of this approach is O(n), where n is the length of the datastream buffer.

Approach 2: Using a Set

  1. Initialize a variable unique_chars as an empty set.
  2. Iterate through the datastream buffer.
  3. For each character, add it to the unique_chars set.
  4. If the size of unique_chars is 4, return the current index + 1 (since the index is 0-based).
  5. If the size of unique_chars is greater than 4, remove the oldest character from the set and continue the loop.
  6. If the loop ends without finding a marker, return -1 (or an appropriate error message).

This approach works by using a set to store the unique characters. It checks if the size of the set is 4, and if so, returns the position of the marker. The time complexity of this approach is also O> (n), where n is the length of the datastream buffer.

Approach 3: Brute Force ...

I think doing this and then forking to pursue each of the different approaches might solve some of the days it struggles with.

Increase the temperature

I set temperature to 0. This means there is very little randomness in the responses the model gives. Turning up the temperature might give different answers and get lucky and solve a day (for more $!)

Appendix: Oberservations / notes

  • The model is quite hard to steer to giving specific formats as the output.

  • I tried using "system prompts": e.g. "You are Advent of Code GPT who helps solve Advent of Code programming challenges".

    • The default system prompt seems very helpful and this didn't have much of an impact for me.
  • The model does not appear to be struggling at digesting the very long, Santa and Elf based, problems and understanding them.

    • This is likely because the models are very comfortable with long text inputs.
  • When I implemented debugging the code would print many console.logs which would exceed the token limit (and cost $) so I had to truncate it.

  • My implementation solves Part 2 questions from scratch. I pass in the solution to Part 1 in the problem description.

    • This isn't really how humans would do it - but the model did just fine.
  • I didn't use openai's dedicated code based apis Codex, GPT feels like it might be their strongest model for training at the moment.

Intuitions / tips

  • Show, don't ask.
    • I think console.log('\n', answer) is better than "print the answer on it's own line".
  • Don't ask for too much at once.
    • It feels like prompting with: "do X, and Y and also Z" cause the model to latch on to 1 or 2 out of the 3.
    • I think splitting up your asks into multiple prompt messages is an ~ok strategy to solve this.
  • If you leave something in an earlier prompt, it will continue to influence future responses
    • One strategy I tried was to ask it to summarize the problem into a shorter problem. Which didn't appear to help. To use this strategy you'd need to ask for a summary then start a new chat with the summary so it has no knowledge of the original problem statement.

Feel free me email me at richard [at] rgill [dot] co [dot] uk if you have any questions or comments.