πŸ€– Model Management#

AI Cortex ships with a built-in database of hundreds of community-hosted Ollama models, organized by family. This page covers how to browse and query that catalog, how to read rich model metadata, and how to keep the database fresh with the four-step discovery pipeline.

πŸ“¦ Model Families#

AI Cortex groups models into five families: llama, mistral, gemma, deepseek, and qwen. Each family maps to a JSON file bundled inside the package.

from aicortex import families

print(families())
# ['llama', 'mistral', 'gemma', 'deepseek', 'qwen']

πŸ” Listing Models#

All models across every family#

from aicortex import models

all_models = models()
print(f"Total available: {len(all_models)} models")
print(all_models[:5])
# ['llama3.2:3b', 'llama3.1:8b', 'mistral:7b', ...]

Models in a specific family#

llama_models   = models("llama")
mistral_models = models("mistral")

print("Llama:",   llama_models)
print("Mistral:", mistral_models)

Checking model availability#

from aicortex import models

available = models()

if "llama3.2:3b" in available:
    print("βœ… Model is available")
else:
    print("❌ Not found β€” try one of:", available[:5])

πŸ“‹ Model Metadata#

Each model entry carries two categories of metadata: identity fields and performance benchmark fields collected when the model database was last refreshed.

Getting full model info#

from aicortex import get_model_info

info = get_model_info("llama3.2:3b")

for key, value in info.items():
    print(f"  {key}: {value}")

Identity fields#

Field

Description

id

Unique UUID for this model record

model_name

Human-readable model name (e.g. llama3.2:3b)

model

Ollama model tag (same as model_name)

family

Model family (llama, mistral, etc.)

format

Weight format (typically gguf)

parameter_size

Parameter count (e.g. 3.2B, 7B, 70B)

quantization_level

Quantization used (e.g. Q4_K_M, Q8_0, F16)

size

Raw model size in bytes

digest

SHA256 digest of the model weights

parent_model

Base model this was fine-tuned from (if any)

modified_at

Timestamp the model was last modified on its server

date_added

Timestamp when this record was first added to the database

Server location fields#

Field

Description

ip_port

Full URL of the hosting Ollama server

ip_city_name_en

City of the server

ip_country_name_en

Country of the server

ip_country_iso_code

ISO country code

ip_continent_code

Continent code (e.g. EU, NA)

ip_continent_name_en

Full continent name

ip_isp

Internet service provider

ip_organization

Hosting organization

ip_connection_type

Connection type (e.g. Corporate, Residential)

ip_autonomous_system_number

ASN of the hosting network

ip_autonomous_system_organization

ASN owner name

Performance benchmark fields#

These are recorded during the discovery pipeline’s live test of each endpoint.

Field

Description

perf_status

"success" or "error"

perf_tokens

Number of tokens generated in the benchmark run

perf_time_seconds

Total generation time in seconds

perf_tokens_per_second

Average throughput

perf_max_token_speed

Peak token generation speed

perf_avg_token_speed

Average token generation speed

perf_first_token_time

Time-to-first-token in seconds

perf_model_size_bytes

Model size reported by the server

perf_last_tested

Timestamp of the last benchmark run

perf_error

Error message if perf_status is "error"

🌐 Server Discovery#

Models are hosted on community Ollama servers. AI Cortex automatically selects a working server when you call chat(), but you can also inspect server availability directly.

List all servers hosting a model#

from aicortex import list_model_servers

servers = list_model_servers("llama3.2:3b")

for s in servers:
    print(f"  {s['ip_port']}  β€”  {s['ip_city_name_en']}, {s['ip_country_name_en']}")

Get server info for a specific model#

from aicortex import get_server_info

# Pick any working server for the given model
info = get_server_info("llama3.2:3b")
print(f"Server:   {info['ip_port']}")
print(f"Location: {info['ip_city_name_en']}, {info['ip_country_name_en']}")
print(f"Speed:    {info['perf_tokens_per_second']} tok/s")

# Or target a specific server URL
info = get_server_info("llama3.2:3b", "http://5.149.249.212:11434")

LangChain-compatible params#

get_llm_params() and get_random_llm_params() return a dict you can unpack directly into LangChain’s OllamaLLM:

from aicortex import get_llm_params, get_random_llm_params
from langchain_community.llms import OllamaLLM

# Pick a server for a specific model
params = get_llm_params("mistral:7b")
# β†’ {'model': 'mistral:7b', 'base_url': 'http://...'}

llm = OllamaLLM(**params)

# Pick any random model from any available server
params = get_random_llm_params()
llm = OllamaLLM(**params)

βš™οΈ Model Selection in chat()#

Default model#

The default model is "gpt-oss:20b". You can override it per call:

from aicortex import chat

# Uses default model
response = chat("What is 2 + 2?")

# Specify explicitly
response = chat("Write a sorting algorithm.", model="llama3.2:3b")
response = chat("Summarize this text.", model="mistral:7b")

Choosing by performance#

Use get_model_info() to compare benchmarks before picking a model:

from aicortex import models, get_model_info

for name in models("llama"):
    try:
        info  = get_model_info(name)
        speed = info.get("perf_tokens_per_second", "?")
        size  = info.get("parameter_size", "?")
        print(f"  {name:<25} {size:<8} {speed} tok/s")
    except Exception:
        continue

Forcing a specific server#

import os

os.environ["OLLAMA_HOST"] = "http://5.149.249.212:11434"

response = chat("Hello!", model="llama3.2:3b")

πŸ“ Advanced Request Building#

build_api_request() constructs the raw Ollama API payload β€” useful when you need full control or are integrating with a custom HTTP layer:

from aicortex import build_api_request

payload = build_api_request(
    model="llama3.2:3b",
    prompt="Explain recursion.",
    temperature=0.3,
    max_tokens=300,
    top_p=0.9,
    stop=["\n\n", "END"],
)

print(payload)

πŸ—‚οΈ JSON Database Structure#

The model database lives in aicortex/models/, one file per family:

aicortex/models/
β”œβ”€β”€ llama.json
β”œβ”€β”€ mistral.json
β”œβ”€β”€ gemma.json
β”œβ”€β”€ deepseek.json
└── qwen.json

Each file uses a nested envelope matching the Ollama registry format:

{
  "props": {
    "pageProps": {
      "models": [
        {
          "id": "36a29c78-bb0a-49ef-a21a-e6a15b5b1dd1",
          "ip_port": "http://5.149.249.212:11434",
          "model_name": "llama3.2:3b",
          "model": "llama3.2:3b",
          "family": "llama",
          "format": "gguf",
          "parameter_size": "3.2B",
          "quantization_level": "Q4_K_M",
          "size": "2019393189",
          "digest": "a80c4f17...",
          "modified_at": "2026-03-22T00:41:35Z",
          "date_added": "2026-03-22T00:41:35Z",
          "ip_city_name_en": "Amsterdam",
          "ip_country_name_en": "The Netherlands",
          "ip_country_iso_code": "NL",
          "perf_status": "success",
          "perf_tokens_per_second": "13.01",
          "perf_first_token_time": "3.597",
          "perf_last_tested": "2025-04-19T08:24:12Z"
        }
      ]
    }
  }
}

πŸ”„ Refreshing the Model Database#

The bundled database is a static snapshot. Use the four-step pipeline to pull fresh data from live community servers.

Full pipeline at a glance#

from pathlib import Path
from aicortex.tools import (
    find_valid_endpoints,   # Step 1 β€” ping known IPs
    fetch_models,           # Step 2 β€” pull model lists
    resolve_models,         # Step 3 β€” merge with IP metadata
    apply_valid_models,     # Step 4 β€” write family JSONs
)

json_dir = Path("aicortex/models")

# Step 1 β€” check which ip_port entries in the JSON files are actually alive
valid_urls = find_valid_endpoints(json_dir)
print(f"Live endpoints: {len(valid_urls)}")

url_file = Path("valid.txt")
url_file.write_text("\n".join(valid_urls))

# Step 2 β€” fetch the model list from each live endpoint
fetch_models(url_file, Path("fetched.json"))

# Step 3 β€” merge fetched data with existing IP/perf metadata
resolve_models(Path("fetched.json"), json_dir, Path("resolved.json"))

# Step 4 β€” group by family and write updated JSON files (with backup)
apply_valid_models(Path("resolved.json"), json_dir, backup=True)

For full documentation on each step, see πŸ”§ Tools.

⚑ Performance & Selection Guide#

Parameter count vs. speed#

Size class

Range

Best for

Small

1B – 3B

Fast Q&A, simple code, edge hardware

Medium

7B – 13B

Best quality-to-speed balance

Large

30B – 70B

Complex reasoning; needs capable server

Quantization guide#

Level

Quality

Size

Notes

Q4_K_M

Good

Small

Default on most community servers

Q5_K_M

Better

Moderate

Recommended if server allows

Q8_0

High

Large

Near-lossless compression

F16

Full precision

Very large

Maximum accuracy

πŸ› οΈ Troubleshooting#

Model not found#

from aicortex import models

available = models()
query = "llama3"
suggestions = [m for m in available if query in m]
print(f"Did you mean: {suggestions}")

Server connection failure#

from aicortex import get_server_info

try:
    info = get_server_info("llama3.2:3b")
    print(f"βœ… Server OK: {info['ip_port']}")
except Exception as e:
    print(f"❌ No working server found: {e}")
    print("Try a different model or run the refresh pipeline.")

Slow responses or stale data#

The bundled JSON is a static snapshot. If you’re hitting slow or unresponsive servers, run the refresh pipeline to rebuild the database from currently live endpoints.