LLM Fuzz Part 1 - Infinite Forest

# Introduction After writing a blog where I experimented with [[LLMs to Write Fuzzers]] I kept thinking about how to improve my approach and if I could scale to fuzz entire repos with minimal interaction from my end. Writing fuzzers requires understanding the code, identifying functions, and writing a specific test harnesses for each function. It is my hypothesis that GPT can automate the writing, fixing, and triaging of fuzz tests as we are exclusively dealing with short code fragments. This post will be focused on explaining my project approach, the fuzzing and implementation process in detail, and introducing my newly created LLMFuzz tool. This post will only show code snippets as the project is still being actively developed and refactored and the tool will change. I hope you enjoy and learn a thing or two! ## Background Info Large Language Models (LLM's) like OpenAI's GPT models are a type of artificial intelligence technology that uses deep learning to analyze and generate human language, code, and images. ChatGPT took the development world by storm in part because the underlying models (GPT3.5 and GPT4) are brilliant at writing and analyzing source code. LLM's are great tools but have many limitations, for example they are good at code but bad at facts. From my experience they have a tendency to hallucinate and make up methods and API endpoints and the 4097 or 8192 token limits makes it a challenge to get the entire context for complex code right. But even with those limitations in mind the code proficiency makes them a force multiplier for domain experts writing tools. I liken them to a eager intern or junior developer, they will happily churn out code and try anything you tell them, even if they don't fully understand why they are performing a task it will happily do anything you prompt. But the onus is on you to verify correctness and validate the code output. Fuzzing is a security testing technique in which a tool generates random inputs to identify potential vulnerabilities or more generally discover non-standard behavior in software programs. In addition to increasing resilience, bugs that cause unusual or unexpected behaviors and errors can present security implications exploited by malicious users and should be eliminated from software. I believe fuzzing should be integrated into the development lifecycle of every software program much like unit tests. [Atheris](https://github.com/google/atheris) is a Python based, coverage guided fuzzer that was chosen because Python is the language OpenAI GPT models understand best. Additionally this is a relatively new project with limited training data which should allow us to test the limits of GPT, and lastly there is a large number of open source Python libraries that do not have fuzz tests so they can be easy targets to test efficacy. ### Side Note on Python Bug Types Historically, fuzzers were used to identify crashes in memory-managed software which attackers could exploit with sophisticated techniques such as heap overflows to get code execution. While fuzz testing is crucial for memory-managed software, it has also found widespread application in dynamic languages like Python. When fuzzing memory safe languages like Python the fuzzer enables us to discover bugs such as unhandled exceptions, logic bugs, and security issues stemming from both logic bugs and denial-of-service attacks caused by hangs and excessive memory usage. My experience with fuzzing Python code has revealed a diverse range of bugs, such as out-of-bounds access, AttributeError occurrences related to None object attribute access, usage of max() on empty lists, improper error handling leading to unexpected crashes, and erroneous refactoring resulting in calls to non-existent functions. Although unexpected exceptions typically result in denial-of-service attacks (by crashing a program), they often expose more severe underlying bugs in libraries. Another one of the most effective use cases for the Atheris fuzzer is creating differential fuzzers. Differential fuzzers specifically target discrepancies in the behavior of two libraries intended to perform identical tasks, revealing potential vulnerabilities and weaknesses in their implementations. ## Typical Fuzzing Engagement Stages This research project was approached using the same methodology as would be used in a typical fuzzing engagement, except integrating LLM's as much as possible. Part 1 will expand on these stages: * **Reconnaissance**- The process of learning about the target codebase. * **Creating Fuzz Tests** - Using the information gathered during the recon stage to write fuzzers for the targeted library functions. * **Initial Fuzz Run**- Configuring, running and analyzing our newly created fuzzers to see which ones run successfully, which provide good coverage, and which ones need to be fixed or refactored. * **Fixing Broken Fuzzers** - Analyzing the initial fuzz test results to fix the non-running tests and improve coverage of non-path finding fuzzers * **Extended Fuzz Analysis** - Running all working fuzzers for longer periods of time to capture detailed metrics These remaining stages will be followed up in part 2: * Triaging Crashes - Review all crashes and exceptions to evaluate security impact. * Reporting Bugs - Sharing of findings with the development team and discussing impact. * Analyzing Fuzzer Results - Comparing the individual fuzzing results, coverage statistics, and vulnerability discovery rates in addition to measure efficacy. # Reconnaissance Reconnaissance, often abbreviated as Recon, is the initial information gathering stage in the offensive security process. The goal is to understand the repo's structure, core functionalities, identify critical code paths, and potential vulnerability sinks. In a typical workflow this step is performed manually. Downloading the code into an IDE, exploring the codebase in-depth, and searching for potentially dangerous functions, such as those responsible for parsing, handling user input, or exhibiting high complexity. I also look for any existing fuzzers and examine the tests to understand how to separate and use specific functionality without running the entire program. A thorough recon process sets the foundation for a more effective fuzzing campaign, as it helps the researcher gain valuable insights into the codebase and its attack surface. Usually, security testers would be limited by time constraints and their capacity to understand the codebase and identify critical code paths. Since GPT is excellent at understanding code and is not time constrained we simply extract all functions. Keeping in mind our goal to create a fully automated fuzzing chain with an eye towards scale I want to provide a repo and let the tool do the rest. Our llm-fuzz tool performs this stage in three steps: 1) [[#Download the Code]] 2) [[#Search for Existing Fuzzers]] in the current repo and oss-fuzz 3) [[#Get Code for all functions]] and save it in the DB To use our code we must first create a LLMFuzzRecon object and set the DB used to store data, the path where we download the repos and create the generated files, the target libraries, and the repo language. The object initialization sets the values and calls several functions which we will explain. ```python llmfuzz_recon = LLMFuzzRecon("fuzz-forest.db", "working_directory/", target_libraries_dict, 'python') class LLMFuzzRecon: def __init__(self, sqlitedb, repo_path, target_libraries_dict, language): self.db = Database(sqlitedb) self.sqlitedb = sqlitedb self.repo_path = repo_path self.language = language # functions called during initialization self.save_library_info(target_libraries) self.download_oss_fuzz_repo() self.download_github_repos(target_libraries) self.get_fuzz_files(target_libraries) self.radon_analysis(target_libraries) self.get_code_all_functions(target_libraries) self.clean_functions_in_DB(target_libraries) ``` The arguments should be self-explanatory except for the **target_libraries_dict**, which is a python dictionary of any length that has a 'library_name' and 'github_url' like: ```python target_libraries = {'requests': 'https://github.com/psf/requests'} ``` The first init function; `save_library_info()` creates a sqlitedb with entries for the **library_name** and **github_url** if they do not exist. ### Download the Code Next in the code download process is `download_oss_fuzz_repo()` which pulls the oss-fuzz repo from github. [OSS-fuzz](https://github.com/google/oss-fuzz) is Google's open source fuzzer infrastructure and contains a large number of fuzz tests for open source libraries in all languages. The `download_github_repos()` function is next and downloads the code from the previously supplied **github_url** into the **working_directory**. ### Search for Existing Fuzzers The `get_fuzz_files()` function searches the library and oss-fuzz repo for any existing fuzz files and saves them in the database with a `type=fuzzer` column. Any found fuzz test cases are extracted and saved so they can be added to the prompt as context in later steps. ### Identify Target Functions Identifying potential target functions is one of the most time consuming in fuzzing and for code reviews in general. A reviewer has to use their security specific domain knowledge to search the code base for functions with high security potential. I spent a couple of days and tried several different approaches to find functions including strategies like using the inspect library and walking through the entire codebase but decided to use the [radon](https://github.com/rubik/radon)tool. Radon measures the cyclomatic complexity of Python codebases and grades every function. Cyclomatic complexity is the quantitative measure of the number of linearly independent paths in a function and is used to indicate the complexity of a program. It is computed using the Control Flow Graph of the program. This works great for Python code but I realize I will have to change my approach for different languages. We'll figure that out when we get there. For now, the radon tool finds every function in the code and returns the file_name, line numbers, and a complexity score from 'A' - meaning a simple block with low complexity to 'F' defined as 'very high complexity - error-prone'. We then save those values in the database. ### Get Code for all functions The previous function identified and rated all functions in the repo. The `get_code_all_functions()` iterates through every function entry in the database to find the code and store it in the contents column. See the database table we have filled up by this point: ```python class LibraryFile(Base): __tablename__ = "library_files" id = Column(Integer, primary_key=True) library_name = Column(String) file_name = Column(String) function_name = Column(String) contents = Column(Text) #source code fuzz_test = Column(Boolean) language = Column(String) complexity_score = Column(String) #radon rating type = Column(String) #fuzzer, test, source, etc. file_path = Column(String) file_line_start = Column(Integer) file_line_end = Column(Integer ``` The `clean_functions()` at the end of the class initialization removes some of the functions that were giving me trouble during later tests and would not be fuzzed anyway, stuff like `__init__`, `main()` and any `test()` functions. # Creating Fuzz Tests Once I have a solid grasp of the code and have determined which components to test (e.g., the core processing engine versus utils code), I usually begin writing individual test functions and conducting manual triage. This step is typically slow when done by a human, as they must carefully consider and write each function and run to verify it works. However, LLMs have no such constraints, allowing us to instruct them to target every function in the code without limitations or delay. This is the real appeal of LLM's they can produce content without limits, possibly bad for society but good for us. Our llm-fuzz tool works by initializing a LLMFuzz object and setting the database and language. `llmfuzz = LLMFuzz(sqlitedb, 'python')` Visit the link to view [the complete prompt](https://github.com/tree-wizard/fuzz-forest/blob/main/prompts/base-atheris-prompt.py). It took quite a bit of work and a lot of trial and error to get it right and return decent results. We start with a simple explanation of how Atheris works and a generic fuzz test example. I found that I had to include some info about the data API or it would not use it, I suspect because it is a new feature on a lesser know fuzzer framework, compared to something like AFL or LibFuzzer. Next we show a complex fuzz example taken from the Atheris repo that utilizes the concept of mutators and introduces the idea of exception handling. We end the prompt with a specific directive to make sure we instrument all the code. From my early tests GPT-4 would not include instrumentation functions at all or place them in non-working locations so I added strong emphasis. Another important thing to keep in mind is token lengths. Tokens can be thought of as pieces of words. 100 tokens is about 75 words. This means we have to be cognizant of how long our prompt is, since we might be returning a lot of code. It depends on the model used, but for GPT3.5 and GPT4 requests can use up to 4097 and 8192 tokens respectively, shared between prompt and completion. If your prompt is 4000 tokens, your completion can be 97 tokens at most. Before we continue I also was to show two helper functions that I implemented during testing to help filter out the functions with the highest potential for bugs during testing. ```python radon_score = ['C', 'D', 'E', 'F'] radon_funcs = llmfuzz.get_radon_functions_from_db(library_name, radon_score) parse_functions = llmfuzz.get_functions_that_contain_string(library_name, 'parse') ``` The `get_radon_functions_from_db()` returns all functions from the DB that are of the specified score. As mentioned earlier functions with a radon score of 'A' are simple and non-complex, so rather than waste time and spend money on OpenAI calls and CPU cycles on those we can instead pull those that have a higher probability of containing errors during testing. The `get_functions_that_contain_string()` returns all functions that contain a string. From my experience `parse` functions contain some of the highest attack surface so I used this a lot during testing to get good candidates. But whether we use radon or function names as filters we call the `generate_fuzz_tests()` to pull functions from the DB and ask the LLM to create fuzzers for each function. ```python llmfuzz.generate_fuzz_tests(library_name, radon_funcs) llmfuzz.generate_fuzz_tests(library_name, priority_funcs) llmfuzz.generate_fuzz_tests(library_name) # all functions ``` The `generate_fuzz_tests()` gets the data from the DB, builds the full prompt by combing our previously discussed **base_prompt**, the **directive** string which includes the library and function name and specific instructions to only return code, finally we append the **function_code** for more context and send it to the LLM, OpenAI in this case: ```python directive = f"Return ONLY valid and properly formatted Python code. Do NOT include any comments, explanations or notes. Import the {library_name} and write an atheris fuzz test for the {function_name} function in {library_name}:\n" prompt = base_template + directive + "This is the source code for" + function_name + ":" + function_code # the prompt gets sent to this function def generate_fuzz_code(self, prompt): max_retries = 5 retry_delay = 5 for _ in range(max_retries): try: response = openai.ChatCompletion.create( model='gpt-4', messages=[ {"role": "system", "content": prompt}], max_tokens=1550, temperature=0.6, ) return response["choices"][0]["message"]["content"] except openai.error.RateLimitError as e: print(f"Rate limit error encountered: {e}. Retrying") time.sleep(retry_delay) except openai.error.APIConnectionError as e: print(f"Connection error encountered: {e}. Retrying") time.sleep(retry_delay) print("Failed to generate fuzz code after multiple retries.") return None ``` The generated test is saved in the database with a `fuzz=True` column as we continue iterating through every function in the list. # Initial Testing This step is meant to run the fuzzers for two cycles to verify that the code executes without errors. I initially spent a lot of time trying to write a custom LLM agent that would create the fuzz test, run it, fix the code if it did not work, and triage all in one step. But after taking a few days away I decided to take the fail fast approach from my Erlang coding days. I am going to create the test, run it, if the test fails, I am not going to waste time trying to fix it yet. I will mark it as `runs=False` in the database and move onto the next one. This is a better approach as we have such a large number of tests to run through we should only focus on the ones that work and use another process to fix the ones that do not. I return to fixing and triaging in later steps but this perspective allowed me to move forward and really make use of LLM's mass generation capabilities. The current approach creates a LLMFuzz object and calls `initial_fuzz_analysis()` on the target library. The `initial_fuzz_analysis()` pulls every previously generated `fuzz=True` file from the DB and sends it to `run_atheris_fuzzer()` which is a simple subprocess function that takes the code string and attempts to run it for 2 cycles returning and saving the entire output. If the output contains a 'Done 2 runs' in the output that means it was able to instrument and run correctly for at least 2 cycles. If it is not able to run for two cycles or returns an exception we set `runs=False` or `exception=True` and save it. ```python llmfuzz = LLMFuzz(sqlitedb, 'python') llmfuzz.initial_fuzz_analysis(library_name) def initial_fuzz_analysis(self, lib): fuzz_functions = self.get_lib_fuzz_tests_from_db(self.sqlitedb, lib) for function in fuzz_functions: output = run_atheris_fuzzer(function.contents) if 'Done 2 runs' in output: self.update_fuzz_test_in_db(function.id, runs=True, run_output=output) elif 'Exception' in output: self.update_fuzz_test_in_db(function.id,runs=False,run_output=output, exception=True) else: self.update_fuzz_test_in_db(function.id, runs=False, run_output=output) ``` The purpose of the `initial_fuzz_analysis()` is identify all functions which we are able to run without errors and tag them for either fixing or extended testing. # Fixing Broken Fuzzers Our database will now have a long list of generated fuzz functions with details like the code, run output, run status, and a bunch of other info. This is our GeneratedFile Table for clarity: ```python class GeneratedFile(Base): __tablename__ = "generated_files" id = Column(Integer, primary_key=True) library_name = Column(String) file_name = Column(String) function_name = Column(String) contents = Column(Text) runs = Column(Boolean) fuzz_test = Column(Boolean) type = Column(String) coverage = Column(Integer) cycles = Column(Integer) run_output = Column(Text) tokens = Column(Integer) crash = Column(Boolean) exception= Column(Boolean) refactored = Column(Boolean) ``` Non-working fuzz tests are defined as those with a `runs=False` column that was set during the `initial_fuzz_analysis()` call. To extract and fix them we use the same llmfuzz object to call `fix_fuzz_test_code()` on the library. The code sends every `run=False` function's source code and output to `fix_fuzz_test()`. The `fix_fuzz_test_code()` algorithm is simple: * The **function_code** and **output** are sent to `fix_code()` which builds a prompt instructing OpenAI to review and fix the code returning an **updated_code** variable. * The **updated_code** is sent to the same `run_atheris_fuzzer()` to try to run for two cycles * If it works we save the **updated_code**, **new_output**, and update to **runs=True** in the DB * Else we update the output variable and go through the loop again ```python langfuzz.fix_fuzz_test_code(library_name) def fix_code(self, code: str, output: str) -> str: prompt = ( f"Please rewrite the following code to fix any issues or errors:\n\n" f"---Code Starts---\n" f"{code}\n" f"---Code Ends---\n\n" f"Output: {output}\n\n" f"IMPORTANT: Return only valid and properly formatted Python code. Do NOT \ include any comments, explanations or notes on the changes made." ) response = openai.ChatCompletion.create( model='gpt-4', messages=[{"role": "system", "content": prompt}], max_tokens=1550, temperature=0.6, ) return response["choices"][0]["message"]["content"] def fix_fuzz_test(self, function_code, output) -> Tuple[str, str, bool]: successful_run = 'Done 2 runs' max_attempts = 5 updated_code = function_code for attempt in range(max_attempts): updated_code = self.fix_code(updated_code, output) new_output = run_atheris_fuzzer(updated_code) if successful_run in new_output: print('fixed') return updated_code, new_output, True else: output = new_output time.sleep(1) return updated_code, output, False ``` Note how I am very strict with GPT output and add "IMPORTANT: Return only valid and properly formatted Python code. Do NOT include any comments, explanations or notes on the changes made." This was because the model kept returning comments like "Ok it should work now, I imported xxx" or "Sorry for the confusion I meant to update xxx" so this helps stop that but doesn't always work. One out of 50 would still add some commentary about changes that would break the code. Still this prompt loop provides great results and fixes 75% of all functions within the 5 loop limit. The tests that are not fixed usually stem from import errors of libraries or functions that do not exist, proving that even when providing in depth prompts, function source code, and testing a popular open source library, generative AI has a tendency to hallucinate. # Extended Fuzz Analysis In this stage we run our fuzzers for several hours to further exploration, find optimization areas, and hopefully uncover vulnerabilities. The `extended_fuzz_analysis(library_name, time_to_run=20000)` works by pulling every `runs=True exception=false` fuzz test from the database and runs for the specified time, after which we capture the output, parse the coverage metrics. and save the stats in the database. ```python def extended_fuzz_analysis(self, library_name, time=20000): fuzz_functions = self.get_lib_fuzz_tests_from_db(self.sqlitedb, library_name, runs=True, exception=False) for function in fuzz_functions: function_path = os.path.join(generated_files_path, function.function_name) os.makedirs(function_path, exist_ok=True) fuzzer_file_path = os.path.join(function_path, function.file_name) with open(fuzzer_file_path, 'w') as fuzzer_file: fuzzer_file.write(function.contents) command = f'python {fuzzer_file_path} -max_total_time={time}' timeout = time + 300 # add 5 minute timeout to catch hangs output, crash = self.run_fuzzer(command, timeout) exception = 'exception' in output.lower() cov = self.parse_coverage(output) self.update_fuzz_test_in_db(function.id, run_output=output, coverage=cov, exception=exception, crash=crash) ``` For reference this is what the Atheris output we parse looks like while running: ``` INFO: Instrumenting 3173 functions... ... INFO: A corpus is not provided, starting from an empty corpus #2 INITED cov: 47 ft: 47 corp: 1/1b exec/s: 0 rss: 51Mb #14 NEW cov: 48 ft: 48 corp: 2/2b lim: 4 exec/s: 2 ChangeBinInt-ChangeByte- #32 NEW cov: 49 ft: 49 corp: 3/4b lim: 4 exec/s: 3 ChangeBit-ShuffleBytes-InsertByte- #36 NEW cov: 52 ft: 52 corp: 4/7b lim: 4 exec/s: 4 InsertByte-CrossOver-EraseBytes #65 NEW cov: 71 ft: 75 corp: 5/9b lim: 4 exec/s: 4 ShuffleBytes-ChangeByte- #88 NEW cov: 72 ft: 80 corp: 6/12b lim: 4 exec/s: 3 CrossOver-ChangeBit-InsertByte- #89 NEW cov: 74 ft: 90 corp: 7/15b lim: 4 exec/s: 1 ShuffleBytes- #108 NEW cov: 74 ft: 94 corp: 8/19b lim: 4 exec/s: 4 InsertByte-CopyPart-ChangeBit- #110 NEW cov: 76 ft: 96 corp: 9/22b lim: 4 exec/s: 2 CopyPart-InsertByte- #121 NEW cov: 80 ft: 100 corp: 10/25b lim: 4 exec/s: 1 ShuffleBytes- ``` The first number is the total number of runs, **cov** is the number of branches covered, **ft** is number of paths hit, **corp** is short for corpus and shows the size of the testcase, **exec/s** is executions per second, the string at the end is the mutation performed for that run. The output of a crash looks like this: ``` #64 NEW cov: 94 ft:109 corp: 1/19b lim: 4 exec/s: 3 rss: 67Mb === Uncaught Python exception: === ParamValidationError: Parameter validation failed: Invalid length for parameter Key, value: 0, valid min length: 1 Traceback (most recent call last): ``` In both cases we save the entire output, the coverage stats, and whether there was a crash and exception. We use this information to triage and validate. # Conclusion Throughout this walk-through, we demonstrated how LLMFuzz integrates with each stage of a typical fuzzing engagement. By automating reconnaissance, fuzz test creation, and code execution, LLMFuzz effectively eliminates the time-consuming and manual aspects of fuzzing. Additionally, the LLM-driven approach ensures that non-running fuzzers are fixed allowing us to scale to hundreds of functions. The next post in this series will cover our triage and analysis steps. We will also go in depth discussing the results of running on popular open source libraries and give extensive details on generation statistics and total cost to run. Thanks and see you soon.