Skip to content
ブログ

エンジニアリングインサイト

AIシステム、クラウドアーキテクチャ、分散システム、エンジニアリングリーダーシップの深堀り。

LLM Knowledge Distillation: Teacher-Student Architecture for Smaller, Cheaper Models
ai-architecture1 min read

LLM Knowledge Distillation: Teacher-Student Architecture for Smaller, Cheaper Models

Stop paying frontier prices for commodity work. Teacher-student distillation: methods, transfer-set design, the 30-100x cost math, and when not to do it.

July 2, 2026Read
How to Build an MCP Server: Tools, Resources, and Production Architecture
ai-architecture1 min read

How to Build an MCP Server: Tools, Resources, and Production Architecture

A working MCP server fits in 100 lines. Production is the hard part: tool schema design, stdio vs Streamable HTTP, OAuth 2.1, output caps, and audit logging.

July 2, 2026Read
Claude Opus 4.8 vs Sonnet 5 vs Fable 5: Which Model for Which Task
ai-architecture1 min read

Claude Opus 4.8 vs Sonnet 5 vs Fable 5: Which Model for Which Task

Opus 4.8, Sonnet 5, or Fable 5 — official pricing, positioning, and a task-fit decision framework, grounded in Anthropic's own docs, not contradictory third-party leaderboards.

July 1, 2026Read
TPU Inference Architecture: Serving LLMs on Trillium with vLLM
ai-architecture1 min read

TPU Inference Architecture: Serving LLMs on Trillium with vLLM

GPU is not the only serving option in 2026. TPU (Trillium) cost-per-token, the XLA compilation model, vLLM TPU backends, and agent-driven ops for self-hosted LLMs.

July 1, 2026Read
Local-First Architecture: CRDTs, Sync Engines, and Offline-First Apps for 2026
ai-architecture1 min read

Local-First Architecture: CRDTs, Sync Engines, and Offline-First Apps for 2026

The industry over-corrected toward routing everything through the cloud. Local-first architecture: CRDTs, sync engines, and why apps should work offline by default.

July 1, 2026Read
Deep Agents Architecture: Planning, Sub-Agents, and File-System Memory for Long-Horizon Tasks
ai-architecture1 min read

Deep Agents Architecture: Planning, Sub-Agents, and File-System Memory for Long-Horizon Tasks

A simple tool-calling loop collapses on a 100-step task. Deep agents fix it with planning, sub-agents, and a file system as memory — the long-horizon agent pattern.

June 30, 2026Read
Prompt Caching Architecture for LLM Apps & Agents: Prefix Caching, Cost, and Latency
ai-architecture1 min read

Prompt Caching Architecture for LLM Apps & Agents: Prefix Caching, Cost, and Latency

Agents and RAG apps re-send the same long prefix every turn. Prompt caching cuts input cost up to ~90% and speeds first tokens — the win most teams leave off.

June 30, 2026Read
A/B Testing and Online Experimentation for LLM Features
ai-architecture1 min read

A/B Testing and Online Experimentation for LLM Features

A higher offline eval score is a hypothesis, not proof. How to run controlled online experiments on prompts, models, and RAG: architecture, metrics, and the statistics.

June 29, 2026Read
Vector Index Tuning for Production: HNSW, IVF, and Product Quantization
ai-architecture1 min read

Vector Index Tuning for Production: HNSW, IVF, and Product Quantization

The index parameters, not the database brand, decide whether RAG answers in 20ms at 95% recall or 200ms at 80%. Tuning HNSW, IVF, and Product Quantization in production.

June 29, 2026Read
LLM Quantization for Production Inference: INT8, FP8, AWQ, and GGUF
ai-architecture1 min read

LLM Quantization for Production Inference: INT8, FP8, AWQ, and GGUF

GPUs dominate self-hosted inference cost. Quantization cuts memory 2-4x for a small accuracy hit: FP8, INT8, AWQ, GPTQ, GGUF, PTQ vs QAT, and when not to do it.

June 28, 2026Read
Document Chunking Architecture for RAG: Fixed, Semantic, Late, and Contextual Retrieval
ai-architecture1 min read

Document Chunking Architecture for RAG: Fixed, Semantic, Late, and Contextual Retrieval

Chunking is the highest-leverage, most-neglected decision in RAG. Fixed vs recursive vs semantic vs late vs contextual retrieval — and the pipeline that ties them together.

June 28, 2026Read
Serverless AI Agent Runtime: microVM Lifecycle Architecture for Agent Workloads
ai-architecture1 min read

Serverless AI Agent Runtime: microVM Lifecycle Architecture for Agent Workloads

Agents are bursty, long-tailed, and untrusted — exactly what an always-on fleet handles worst. A serverless microVM runtime: scale-to-zero, isolation, and cold-start mitigation.

June 27, 2026Read
Managed vs Self-Hosted Code Sandboxes: A Build-vs-Buy Decision for AI Code Execution
ai-architecture1 min read

Managed vs Self-Hosted Code Sandboxes: A Build-vs-Buy Decision for AI Code Execution

Should you buy a managed code sandbox or self-host Firecracker yourself for AI code execution? A build-vs-buy decision framework across cost, compliance, and control.

June 27, 2026Read
Stateful AI Agent Sandbox Sessions: Pause, Resume & Snapshot with microVMs
ai-architecture1 min read

Stateful AI Agent Sandbox Sessions: Pause, Resume & Snapshot with microVMs

Long-running AI agents wait far more than they work. Stateful microVM sandboxes snapshot on idle and resume in milliseconds — full state kept, near-zero idle cost.

June 27, 2026Read
Data Lakehouse Architecture: Iceberg, Delta & the Medallion Pattern
ai-architecture1 min read

Data Lakehouse Architecture: Iceberg, Delta & the Medallion Pattern

A lakehouse is a warehouse’s table semantics on a lake’s cheap storage, organised by the medallion pattern and kept alive by compaction and governance. How to architect one.

June 26, 2026Read
Zero-Downtime Database Migration Architecture: Expand-Contract, Dual-Write & Backfill
ai-architecture1 min read

Zero-Downtime Database Migration Architecture: Expand-Contract, Dual-Write & Backfill

Change a production schema with no downtime via expand-contract: add the new shape, dual-write, backfill in batches, verify, switch reads, drop the old. Every step reversible.

June 26, 2026Read
Architecting Physical AI Swarms: Edge Inference, Mesh Networking, and Coordinated Autonomy
ai-architecture1 min read

Architecting Physical AI Swarms: Edge Inference, Mesh Networking, and Coordinated Autonomy

A physical AI swarm is a moving distributed system with a hostile network. The architecture for edge inference, masterless coordination, resilient mesh comms, and local safety.

June 25, 2026Read
Webhook Delivery Architecture: Retries, Idempotency, Signing & Ordering
cyber-security-patterns1 min read

Webhook Delivery Architecture: Retries, Idempotency, Signing & Ordering

Webhooks are a distributed-systems problem in an HTTP-POST costume. The producer architecture for at-least-once delivery: retries, dead-letter, HMAC signing, idempotency, ordering.

June 25, 2026Read
Vector Database Architecture: Choosing and Scaling pgvector, Pinecone, Qdrant & Weaviate
ai-architecture1 min read

Vector Database Architecture: Choosing and Scaling pgvector, Pinecone, Qdrant & Weaviate

A vector database is a recall/latency/memory machine behind a similarity-search API. How to choose pgvector, Pinecone, Qdrant, or Weaviate — and what breaks first as it scales.

June 25, 2026Read
Fine-Tuning vs RAG vs Prompt Engineering: A Decision Architecture
ai-architecture1 min read

Fine-Tuning vs RAG vs Prompt Engineering: A Decision Architecture

Fine-tuning, RAG, and prompt engineering answer three different questions — how should it behave, what should it know, what should it be told. A decision architecture.

June 25, 2026Read

最先端を行く

AIシステム、クラウドアーキテクチャ、分散システム、エンジニアリングリーダーシップに関する毎週の深堀り。5,000人以上のエンジニアに参加。