December 21, 2024

A Practical Guide to Reducing LLM Hallucinations with Sandboxed Code Interpreter

Blog

Most LLM and SML are not designed for computation (not talking about OpenAI o1 or o3 models). Imagine the following conversation:

company: Today is Wednesday; you have 24 hours to return your package for delivery.
client: Okay, let’s do it on Tuesday.

Are you sure the next AI response will be correct? As a human being, you understand that next Tuesday is six days away and 24 hours is just one day. However, most LL.M.s cannot handle this logic reliably. Their response is uncertain.

This problem will become more serious as the environment develops. If you have a conversation history of 30 rules and 30 messages, the AI loses focus and is prone to errors.

Common use cases

You are developing an AI scheduling chatbot or AI agent for your company.
The company has scheduling rules that are updated frequently.
Before scheduling, the chatbot must validate the parameters entered by the customer.
If verification fails, the chatbot must notify the customer.

What can we do?

Combine traditional code enforcement with an LL.M. This idea is not new, but it is still underutilized:

OpenAI integrates this functionality into its Assistant API, but not into the Complitions API.
Google recently launched code interpreter Features of Gemini 2.0 Flash.

Our solution technology stack

docker
LangGraph.js
piston

Code Interpreter Sandbox

To safely run generated code, the most popular cloud code interpreters are Electronic 2bGoogle and OpenAI, as I mentioned before.

However, I’ve been looking for an open source, self-hosted solution that allows for flexibility and cost-effectiveness. So, there are two good options:

I chose the piston because of its ease of deployment.

Piston installation

It took me a while to figure out how to add a python execution environment to Piston.

0. Enable cgroup v2

For Windows WSL, This article Very helpful.

1. Run the container

docker run --privileged -p 2000:2000 -v d:\piston:'/piston' --name piston_api ghcr.io/engineer-man/piston

2. Check the piston reservoir

git clone https://github.com/engineer-man/piston

3.Add Python support

Run the following command:

node cli/index.js ppman install python

By default, this command uses the container API running at localhost:2000 Install Python.

Sample code execution

use Piston Node.js client:

import piston from "piston-client";

const codeInterpreter = piston({ server: "http://localhost:2000" });

const result = await codeInterpreter.execute('python', 'print("Hello World!")');

console.log(result);

AI agent implementation

Source code on GitHub

We will use some advanced technologies:

Diagram and subgraph architecture
Parallel node execution
Quadrant for storage
observability through Lan Smith
GPT-4o-minia cost-effective LL.M.

For a detailed overview of the process, see LangSmith Tracking:
https://smith.langchain.com/public/b3a64491-b4e1-423d-9802-06fcf79339d2/r

Step 1: Extract date-time related scheduling parameters from user input

Example: “Tomorrow, last Friday, 2 hours from now, noon.”
We use a code interpreter to ensure reliable extraction since LLM may fail even with the current datetime context information.

Example of Python code generating prompts:

Your task is to transform natural language text into Python code that extracts datetime-related scheduling parameters from user input.  

## Instructions:  
- You are allowed to use only the "datetime" and "calendar" libraries.  
- You can define additional private helper methods to improve code readability and modularize validation logic.  
- Do not include any import statements in the output.  
- Assume all input timestamps are provided in the GMT+8 timezone. Adjust calculations accordingly.  
- The output should be a single method definition with the following characteristics:  
  - Method name: \`getCustomerSchedulingParameters\`  
  - Arguments: None  
  - Return: A JSON object with the keys:  
    - \`appointment_date\`: The day of the month (integer or \`None\`).  
    - \`appointment_month\`: The month of the year (integer or \`None\`).  
    - \`appointment_year\`: The year (integer or \`None\`).  
    - \`appointment_time_hour\`: The hour of the day in 24-hour format (integer or \`None\`).  
    - \`appointment_time_minute\`: The minute of the hour (integer or \`None\`).  
    - \`duration_hours\`: The duration of the appointment in hours (float or \`None\`).  
    - \`frequency\`: The recurrence of the appointment. Can be \`"Adhoc"\`, \`"Daily"\`, \`"Weekly"\`, or \`"Monthly"\` (string or \`None\`).  

- If a specific value is not found in the text, return \`None\` for that field.  
- Focus only on extracting values explicitly mentioned in the input text; do not make assumptions.  
- Do not include print statements or logging in the output.  

## Example:  

### Input:  
"I want to book an appointment for next Monday at 2pm for 2.5 hours."  

### Output:  
def getCustomerSchedulingParameters():  
    """Extracts and returns scheduling parameters from user input in GMT+8 timezone.  

    Returns:  
        A JSON object with the required scheduling parameters.  
    """  
    def _get_next_monday():  
        """Helper function to calculate the date of the next Monday."""  
        current_time = datetime.utcnow() + timedelta(hours=8)  # Adjust to GMT+8  
        today = current_time.date()  
        days_until_monday = (7 - today.weekday() + 0) % 7  # Monday is 0  
        return today + timedelta(days=days_until_monday)  

    next_monday = _get_next_monday()  
    return {  
        "appointment_date": next_monday.day,  
        "appointment_month": next_monday.month,  
        "appointment_year": next_monday.year,  
        "appointment_time_hour": 14,  
        "appointment_time_minute": 0,  
        "duration_hours": 2.5,  
        "frequency": "Adhoc"  
    }

### Notes:
Ensure the output is plain Python code without any formatting or additional explanations.

Step 2: Get rules from storage

They are then converted into Python code for verification.

Step 3: Execute the generated code in the sandbox:

const pythonCodeToInvoke = `
import sys
import datetime
import calendar
import json

${state.pythonValidationMethod}

${state.pythonParametersExtractionMethod}

parameters = getCustomerSchedulingParameters()

valiation_errors = validateCustomerSchedulingParameters(parameters["appointment_year"], parameters["appointment_month"], parameters["appointment_date"], parameters["appointment_time_hour"], parameters["appointment_time_minute"], parameters["duration_hours"], parameters["frequency"])

print(json.dumps({"validation_errors": valiation_errors}))`;

    const traceableCodeInterpreterFunction = await traceable((pythonCodeToInvoke: string) => codeInterpreter.execute('python', pythonCodeToInvoke, { args: [] }));
    const result = await traceableCodeInterpreterFunction(pythonCodeToInvoke);