# Cerebras API — highermark Integration

Cerebras Cloud provides ultra-fast LLM inference — the fastest available for Llama models. Key advantage: ~60-120 tokens/second, ideal for real-time Discord agent responses.

## API Credentials Setup

```
CEREBRAS_API_KEY=your_api_key_here
CEREBRAS_BASE_URL=https://api.cerebras.ai/v1
```

Get your key at: https://api.cerebras.ai

## Available Models

| Model | Context | Best For |
|-------|---------|----------|
| `llama3.1-8b` | 128k | Fast agent replies, low cost |
| `llama3.1-70b` | 128k | High-quality, complex reasoning |
| `llama-3.3-70b` | 128k | Latest, balanced performance |

## Example API Call

```python
from cerebras.cloud.sdk import Cerebras

client = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"))

response = client.chat.completions.create(
    model="llama3.1-8b",
    messages=[
        {"role": "system", "content": "You are Curtana, a Discord AI agent."},
        {"role": "user", "content": "What's happening in the server?"}
    ],
    max_tokens=512,
    stream=False
)
print(response.choices[0].message.content)
```

## Streaming (for real-time Discord responses)

```python
stream = client.chat.completions.create(
    model="llama3.1-8b",
    messages=[...],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
```

## OpenRouter Fallback Integration

Cerebras is primary; OpenRouter is the fallback when Cerebras is at capacity:

```python
import openai

openrouter_client = openai.OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ.get("OPENROUTER_API_KEY"),
)

def call_llm(messages, model="cerebras/llama3.1-8b"):
    try:
        # Try Cerebras first
        return cerebras_client.chat.completions.create(
            model="llama3.1-8b", messages=messages
        )
    except Exception:
        # Fallback to OpenRouter
        return openrouter_client.chat.completions.create(
            model=model, messages=messages
        )
```

## Pricing (approximate)

- `llama3.1-8b`: ~$0.10 / 1M tokens
- `llama3.1-70b`: ~$0.60 / 1M tokens

Average Discord agent interaction: ~500 tokens = ~$0.00005 per message at 8b.

## highermark Device Integration

The ESP32-S3 does not run inference — it's the **runtime and router**. The device:
1. Receives Discord events
2. Packages the context
3. Calls Cerebras API via WiFi
4. Streams the response back to Discord

Key config on device (`config.json`):
```json
{
  "llm_provider": "cerebras",
  "cerebras_api_key": "YOUR_KEY",
  "cerebras_model": "llama3.1-8b",
  "fallback_provider": "openrouter",
  "openrouter_api_key": "YOUR_KEY"
}
```
