A Practical Guide to Reducing LLM Hallucinations with Sandboxed Code Interpreter
December 21, 2024

A Practical Guide to Reducing LLM Hallucinations with Sandboxed Code Interpreter

Most LLM and SML are not designed for computation (not talking about OpenAI o1 or o3 models). Imagine the following conversation:

  • company: Today is Wednesday; you have 24 hours to return your package for delivery.
  • client: Okay, let’s do it on Tuesday.

Are you sure the next AI response will be correct? As a human being, you understand that next Tuesday is six days away and 24 hours is just one day. However, most LL.M.s cannot handle this logic reliably. Their response is uncertain.

This problem will become more serious as the environment develops. If you have a conversation history of 30 rules and 30 messages, the AI ​​loses focus and is prone to errors.


Common use cases

  • You are developing an AI scheduling chatbot or AI agent for your company.
  • The company has scheduling rules that are updated frequently.
  • Before scheduling, the chatbot must validate the parameters entered by the customer.
  • If verification fails, the chatbot must notify the customer.


What can we do?

Combine traditional code enforcement with an LL.M. This idea is not new, but it is still underutilized:

  • OpenAI integrates this functionality into its Assistant API, but not into the Complitions API.
  • Google recently launched code interpreter Features of Gemini 2.0 Flash.


Our solution technology stack


Code Interpreter Sandbox

To safely run generated code, the most popular cloud code interpreters are Electronic 2bGoogle and OpenAI, as I mentioned before.

However, I’ve been looking for an open source, self-hosted solution that allows for flexibility and cost-effectiveness. So, there are two good options:

I chose the piston because of its ease of deployment.



Piston installation

It took me a while to figure out how to add a python execution environment to Piston.


0. Enable cgroup v2

For Windows WSL, This article Very helpful.


1. Run the container

docker run --privileged -p 2000:2000 -v d:\piston:'/piston' --name piston_api ghcr.io/engineer-man/piston
Enter full screen mode

Exit full screen mode


2. Check the piston reservoir

git clone https://github.com/engineer-man/piston
Enter full screen mode

Exit full screen mode


3.Add Python support

Run the following command:

node cli/index.js ppman install python
Enter full screen mode

Exit full screen mode

By default, this command uses the container API running at localhost:2000 Install Python.


Sample code execution

use Piston Node.js client:

import piston from "piston-client";

const codeInterpreter = piston({ server: "http://localhost:2000" });

const result = await codeInterpreter.execute('python', 'print("Hello World!")');

console.log(result);
Enter full screen mode

Exit full screen mode



AI agent implementation

Source code on GitHub

We will use some advanced technologies:

  • Diagram and subgraph architecture
  • Parallel node execution
  • Quadrant for storage
  • observability through Lan Smith
  • GPT-4o-minia cost-effective LL.M.

For a detailed overview of the process, see LangSmith Tracking:
https://smith.langchain.com/public/b3a64491-b4e1-423d-9802-06fcf79339d2/r


Step 1: Extract date-time related scheduling parameters from user input

Example: “Tomorrow, last Friday, 2 hours from now, noon.”
We use a code interpreter to ensure reliable extraction since LLM may fail even with the current datetime context information.

Example of Python code generating prompts:

Your task is to transform natural language text into Python code that extracts datetime-related scheduling parameters from user input.  

## Instructions:  
- You are allowed to use only the "datetime" and "calendar" libraries.  
- You can define additional private helper methods to improve code readability and modularize validation logic.  
- Do not include any import statements in the output.  
- Assume all input timestamps are provided in the GMT+8 timezone. Adjust calculations accordingly.  
- The output should be a single method definition with the following characteristics:  
  - Method name: \`getCustomerSchedulingParameters\`  
  - Arguments: None  
  - Return: A JSON object with the keys:  
    - \`appointment_date\`: The day of the month (integer or \`None\`).  
    - \`appointment_month\`: The month of the year (integer or \`None\`).  
    - \`appointment_year\`: The year (integer or \`None\`).  
    - \`appointment_time_hour\`: The hour of the day in 24-hour format (integer or \`None\`).  
    - \`appointment_time_minute\`: The minute of the hour (integer or \`None\`).  
    - \`duration_hours\`: The duration of the appointment in hours (float or \`None\`).  
    - \`frequency\`: The recurrence of the appointment. Can be \`"Adhoc"\`, \`"Daily"\`, \`"Weekly"\`, or \`"Monthly"\` (string or \`None\`).  

- If a specific value is not found in the text, return \`None\` for that field.  
- Focus only on extracting values explicitly mentioned in the input text; do not make assumptions.  
- Do not include print statements or logging in the output.  

## Example:  

### Input:  
"I want to book an appointment for next Monday at 2pm for 2.5 hours."  

### Output:  
def getCustomerSchedulingParameters():  
    """Extracts and returns scheduling parameters from user input in GMT+8 timezone.  

    Returns:  
        A JSON object with the required scheduling parameters.  
    """  
    def _get_next_monday():  
        """Helper function to calculate the date of the next Monday."""  
        current_time = datetime.utcnow() + timedelta(hours=8)  # Adjust to GMT+8  
        today = current_time.date()  
        days_until_monday = (7 - today.weekday() + 0) % 7  # Monday is 0  
        return today + timedelta(days=days_until_monday)  

    next_monday = _get_next_monday()  
    return {  
        "appointment_date": next_monday.day,  
        "appointment_month": next_monday.month,  
        "appointment_year": next_monday.year,  
        "appointment_time_hour": 14,  
        "appointment_time_minute": 0,  
        "duration_hours": 2.5,  
        "frequency": "Adhoc"  
    }

### Notes:
Ensure the output is plain Python code without any formatting or additional explanations.
Enter full screen mode

Exit full screen mode


Step 2: Get rules from storage

They are then converted into Python code for verification.


Step 3: Execute the generated code in the sandbox:

const pythonCodeToInvoke = `
import sys
import datetime
import calendar
import json

${state.pythonValidationMethod}

${state.pythonParametersExtractionMethod}

parameters = getCustomerSchedulingParameters()

valiation_errors = validateCustomerSchedulingParameters(parameters["appointment_year"], parameters["appointment_month"], parameters["appointment_date"], parameters["appointment_time_hour"], parameters["appointment_time_minute"], parameters["duration_hours"], parameters["frequency"])

print(json.dumps({"validation_errors": valiation_errors}))`;

    const traceableCodeInterpreterFunction = await traceable((pythonCodeToInvoke: string) => codeInterpreter.execute('python', pythonCodeToInvoke, { args: [] }));
    const result = await traceableCodeInterpreterFunction(pythonCodeToInvoke);
Enter full screen mode

Exit full screen mode

Source code on GitHub



potential improvements

  • Implemented iterative loops for LLM to dynamically debug and optimize Python code execution.
  • Human participation in the loop generated by the verification method code.
  • Cache generated code.


final thoughts

Bytecode execution and token-based LLM are highly complementary technologies that unlock new flexibility. There is a bright future for this collaborative approach, exemplified by AWS’s recent “Bedrock Automated Reasoning,” which appears to offer similar solutions across its enterprise ecosystem. Google and Microsoft will soon show us something similar.

2024-12-21 02:06:30

Leave a Reply

Your email address will not be published. Required fields are marked *