π₯οΈ Server Mode#
AI Cortex can run as a local OpenAI-compatible HTTP proxy, letting you connect any OpenAI client β the Python SDK, curl, LangChain, or any third-party tool β to free community Ollama models with zero code changes.
π¦ Installation#
Server mode requires FastAPI and Uvicorn. Install them with the server extra:
pip install aicortex-core[server]
π Starting the Server#
Python#
from aicortex.tools import run_server
# Quickstart β defaults to 127.0.0.1:8000
run_server()
# Custom configuration
run_server(
host="0.0.0.0", # Expose on all interfaces
port=8080, # Custom port
default_model="llama3.2:3b", # Override default model
reload=False, # Disable auto-reload in production
)
Command line#
python -m aicortex.tools.server
Environment variables#
You can configure the server without touching Python:
Variable |
Default |
Description |
|---|---|---|
|
|
Default model when none is specified |
|
|
Server bind address |
|
|
Server port |
DEFAULT_MODEL=mistral:7b PORT=8080 python -m aicortex.tools.server
π API Endpoints#
GET / β Server info#
Returns service metadata and a map of available endpoints.
{
"service": "AI Cortex OpenAI-Compatible Proxy",
"version": "1.0.3",
"default_model": "gpt-oss:20b",
"endpoints": {
"models": "/models",
"openai_models": "/v1/models",
"chat_completions": "/v1/chat/completions",
"health": "/health"
}
}
GET /health β Health check#
Lightweight liveness probe β returns instantly, no model I/O.
{
"status": "ok",
"timestamp": 1746187200.123
}
GET /config β Runtime configuration#
Inspect what the running server is configured with.
{
"default_model": "gpt-oss:20b",
"host": "127.0.0.1",
"port": 8000,
"available_models": ["llama3.2:3b", "mistral:7b"],
"model_count": 2
}
GET /models β AI Cortex model list#
Returns available models in AI Cortexβs native format.
{
"models": ["llama3.2:3b", "mistral:7b"],
"default_model": "gpt-oss:20b",
"total_models": 2
}
GET /v1/models β OpenAI-compatible model list#
Matches the OpenAI /v1/models response schema exactly.
{
"object": "list",
"data": [
{
"id": "llama3.2:3b",
"object": "model",
"created": 1640995200,
"owned_by": "aicortex"
},
{
"id": "mistral:7b",
"object": "model",
"created": 1640995200,
"owned_by": "aicortex"
}
]
}
POST /v1/chat/completions β Chat completions#
The primary endpoint. Accepts the full OpenAI chat completions request body.
Request:
{
"model": "llama3.2:3b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum entanglement simply."}
],
"stream": false,
"temperature": 0.7,
"max_tokens": 200
}
Response (non-streaming):
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1746187200,
"model": "llama3.2:3b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Quantum entanglement is when two particles become linked..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 24,
"completion_tokens": 61,
"total_tokens": 85
}
}
Response (streaming, "stream": true):
The server returns Server-Sent Events in OpenAIβs streaming format. Each event carries a partial delta:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"delta":{"content":"Quantum"},"index":0}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"delta":{"content":" entanglement"},"index":0}]}
data: [DONE]
π‘ Client Examples#
curl β non-streaming#
curl -s -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}]
}' | python3 -m json.tool
curl β streaming#
curl -N -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Count to five."}],
"stream": true
}'
OpenAI Python SDK β drop-in replacement#
Change only base_url and api_key. Everything else is identical to normal OpenAI usage:
from openai import OpenAI
client = OpenAI(
api_key="none", # Required by the SDK, not validated
base_url="http://localhost:8000/v1", # Point to AI Cortex
)
response = client.chat.completions.create(
model="llama3.2:3b",
messages=[{"role": "user", "content": "What is the Fibonacci sequence?"}],
temperature=0.5,
max_tokens=150,
)
print(response.choices[0].message.content)
OpenAI Python SDK β streaming#
from openai import OpenAI
client = OpenAI(api_key="none", base_url="http://localhost:8000/v1")
with client.chat.completions.stream(
model="mistral:7b",
messages=[{"role": "user", "content": "Write a short poem about the ocean."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
LangChain#
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="llama3.2:3b",
openai_api_key="none",
openai_api_base="http://localhost:8000/v1",
temperature=0.7,
)
response = llm.invoke("Explain the CAP theorem.")
print(response.content)
β Error Handling#
The server returns standard HTTP status codes with descriptive JSON bodies.
Status |
Meaning |
|---|---|
|
Bad request β missing or invalid parameters |
|
Internal error β model unavailable or server-side failure |
Example error response:
{
"detail": "Model 'gpt-4' not found. Available models: ['llama3.2:3b', 'mistral:7b']"
}
π Production Deployment#
Expose on the network#
run_server(host="0.0.0.0", port=8000, reload=False)
systemd service#
[Unit]
Description=AI Cortex OpenAI-Compatible Proxy
After=network.target
[Service]
User=www-data
WorkingDirectory=/opt/aicortex
ExecStart=/opt/aicortex/.venv/bin/python -m aicortex.tools.server
Restart=on-failure
RestartSec=5
Environment=DEFAULT_MODEL=llama3.2:3b
Environment=PORT=8000
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable aicortex
sudo systemctl start aicortex
nginx reverse proxy#
server {
listen 80;
server_name ai.example.com;
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_buffering off; # Required for SSE streaming
proxy_read_timeout 300s;
}
}
Important: Set
proxy_buffering offβ without it, SSE streaming responses will be buffered and appear to hang.
Health check integration#
The /health endpoint is designed for load balancers and uptime monitors:
# Returns exit code 0 if healthy, 1 if not
curl -sf http://localhost:8000/health > /dev/null && echo "UP" || echo "DOWN"