NNO Documentation

Date: 2026-03-30 Status: Detailed Design Parent: System Architecture Service: services/provisioning

Overview

The NNO Provisioning Service is responsible for creating, configuring, and deleting Cloudflare infrastructure resources on behalf of client platforms. It is the only NNO service that calls the Cloudflare API directly.

NNO follows a lazy, on-demand provisioning model: platforms receive a minimal resource footprint on signup (auth Worker + registry record + billing customer), and additional Cloudflare resources (Pages projects, D1 databases, R2 buckets, KV namespaces) are created only when a feature that requires them is activated. This mirrors the model of Supabase or Firebase — a project exists instantly, and capabilities are enabled as needed.

Every provisioning operation is:

Queue-driven — POST /provision/* creates a job record (PENDING), enqueues it to the Cloudflare Queue, and returns 202 Accepted. The queue consumer executes the job asynchronously, making real Cloudflare API calls. Callers poll GET /provision/jobs/:jobId to track progress.
Tracked — recorded as a provision_job in the Provisioning D1 before any steps run
Idempotent — steps check whether the target resource already exists before calling the CF API; already-created resources are detected and skipped
Transactional in intent (Phase 2) — failed jobs will trigger rollback of all completed steps
Audited — every state change is written to structured logs

Phase 1 vs Phase 2: Phase 1 queue-based async execution is live — real Cloudflare API calls are made for all six job types. The detailed design in §1.2–§1.4 and §6 (Rollback) describe Phase 2 targets (typed CF client class, rollback traversal). Phase 2 will also add DLQ alerting and queue event wiring from IAM/Registry (§7).

┌─── Trigger paths ────────────────────────────────────────────────┐
│                                                                   │
│  1. Registration (automatic)                                      │
│     services/iam sign-up ──► platform.registered (CF Queue)      │
│                                                                   │
│  2. Feature activation (self-service)                             │
│     services/registry PATCH /features ──► feature.activating     │
│                              (CF Queue)                           │
│                                                                   │
│  3. Operator override (manual)                                    │
│     NNO Zero UI ──► services/gateway ──► POST /provision/*       │
│                                                                   │
└──────────────────────────────┬────────────────────────────────────┘
                               │
                               ▼
                  NNO Provisioning Service
                  (CF Queue consumer)
                               │
               ┌───────────────┼───────────────┐
               ▼               ▼               ▼
         Cloudflare API   NNO Registry    services/billing
         (CF Workers,     (platform,      (Stripe customer
          D1, R2, KV,      resources,      creation)
          Pages)           feature status)

0.5 Phase 1 Implementation Detail [Phase 1]

Implementation flow detail: Provisioning Phase 1 Plan.

Implemented Job Types

All six job types are wired and make real CF API calls:

Job Type	Executor file	CF API calls made
`BOOTSTRAP_PLATFORM`	`executors/provision-platform.ts`	D1 create, Worker deploy, secrets, migrations, Pages project + build
`ACTIVATE_FEATURE`	`executors/activate-feature.ts`	D1 create (conditional), Worker deploy (conditional), secrets, migrations
`DEACTIVATE_FEATURE`	`executors/deactivate-feature.ts`	Worker delete only — D1 is never deleted
`PROVISION_STACK`	`executors/provision-stack.ts`	Shared resources + per-feature sub-jobs
`DEACTIVATE_STACK`	`executors/deactivate-stack.ts`	Enqueues `DEACTIVATE_FEATURE` sub-jobs; deletes shared resources only if `deleteData: true`
`DEPROVISION_PLATFORM`	`executors/deprovision-platform.ts`	Full teardown
`ONBOARD_PLATFORM` (Implemented)	`executors/onboard-platform.ts`	No direct CF API calls — orchestrates Registry + Billing + enqueues `BOOTSTRAP_PLATFORM`
`UPGRADE_AUTH_WORKER`	`executors/upgrade-auth-worker.ts`	Fetches latest auth bundle from `NNO_AUTH_BUNDLE_URL`, re-deploys the auth Worker (preserving D1 binding), refreshes `CORS_ORIGINS` secret
`CREATE_APP`	`executors/create-app.ts`	Creates a CF Pages project and/or stub Worker for a new app/service within an existing workspace stack; optionally creates a D1 database; registers DNS hostnames and resources in Registry
`ADD_CUSTOM_DOMAIN`	`executors/add-custom-domain.ts`	Adds a CF4SaaS custom hostname to an existing DNS-registered resource; requires `CF_ZONE_ID`; registers the custom domain record in Registry

ONBOARD_PLATFORM (Implemented) — outer onboarding job that orchestrates pre-provisioning steps before triggering platform resource creation. Steps: (1) create platform + entity records in Registry, (2) create Stripe customer + subscription in Billing, (3) enqueue BOOTSTRAP_PLATFORM job for CF resource creation, (4) update onboarding_sessions checklist as each step completes. Triggered by the self-serve onboarding endpoint in Registry. Wraps but does not replace BOOTSTRAP_PLATFORM.

Phase 1 implementation detail: see Provisioning Phase 1 Plan.

0. Provisioning Triggers [Phase 1]

Provisioning is initiated by three paths, all converging on the same job queue:

0.1 Registration Trigger (Automatic)

When a new client registers, services/iam emits a platform.registered event to a Cloudflare Queue. The provisioning consumer picks this up and runs BOOTSTRAP_PLATFORM — creating only the minimum viable resources needed before the client can log in:

Client sign-up
  │
  ▼
services/iam
  ├── Create user record in IAM D1
  ├── Create platform record in Registry
  └── Enqueue: platform.registered → CF Queue
                    │
                    ▼
         services/provisioning (queue consumer)
           ├── Deploy auth Worker        [always]
           ├── Create auth D1 + migrate  [always]
           └── Create Stripe customer    [always]

No CF Pages project, no feature D1 databases, no R2 buckets — those are created on-demand when features are activated.

0.2 Feature Activation Trigger (Self-Service)

When a client activates a feature from their console, services/registry updates the feature record to activating and emits a feature.activating event. The provisioning consumer reads the feature's FeatureManifest.resources declaration and creates only the resources that feature requires:

Client activates "Analytics" feature
  │
  ▼
services/registry
  ├── Quota check via services/billing   ← plan gate
  ├── PATCH feature_activation status → 'activating'
  └── Enqueue: feature.activating → CF Queue
                    │
                    ▼
         services/provisioning (queue consumer)
           ├── Read FeatureManifest.resources
           │   { d1: true, worker: true, minimumPlan: 'growth' }
           ├── Create analytics D1 + run migrations
           ├── Deploy analytics Worker
           └── PATCH feature_activation status → 'active'

See Section 2.2 for the full ACTIVATE_FEATURE step table.

0.3 Operator Override (Manual)

The NNO operator can provision, deprovision, or force-activate resources for any platform via the operator portal or directly via the provisioning API. Like all other trigger paths, this creates a provision_job record (PENDING) and enqueues it to the Cloudflare Queue — the same queue consumer (handleQueueBatch) executes it asynchronously. The HTTP endpoint returns 202 Accepted immediately; callers poll GET /provision/jobs/:jobId for progress.

The operator plane is a management/override layer — not the primary provisioning path for normal client onboarding.

0.4 Stack Activation Trigger (Self-Service)

When a platform admin activates a stack template (or creates a platform-local stack) from the portal or via CLI, services/registry validates the StackDefinition and emits a stack.activating event:

Platform admin activates "[email protected]" from portal UI
  │
  ▼
services/registry
  ├── Fetch StackDefinition from services/stack-registry (for template-based stacks)
  │   OR use inline definition (for platform-local stacks)
  ├── Validate all featureIds exist in feature catalogue
  ├── Quota check via services/billing (minimumPlan gate)
  ├── Create stacks record (status: PENDING)
  └── Enqueue: stack.activating → CF Queue
                    │
                    ▼
         services/provisioning (queue consumer)
           ├── Fetch StackDefinition from stack-registry (step 1)
           ├── Provision shared resources (D1, R2, KV as declared)
           └── Activate each feature with shared resource bindings

Returns 202 \{ stackInstanceId, jobId \}. The job is tracked as PROVISION_STACK type.

1. Cloudflare API Integration [Phase 1]

1.1 API Credentials

In Phase 1 (NNO full control), the Provisioning Service uses a single set of credentials stored as Wrangler secrets:

Secret	Description
`CF_API_TOKEN`	Cloudflare API token with scoped permissions (see below)
`CF_ACCOUNT_ID`	Cloudflare account ID
`CF_ZONE_ID`	Cloudflare Zone ID — required by the `ADD_CUSTOM_DOMAIN` executor for CF4SaaS custom hostname registration

Required token permissions (scoped, not global):

Permission	Scope	Why
`Workers Scripts:Edit`	Account	Deploy and delete Workers
`Workers Routes:Edit`	Zone	Manage Worker routes
`Cloudflare Pages:Edit`	Account	Create and manage Pages projects
`D1:Edit`	Account	Create, delete D1 databases
`Workers R2 Storage:Edit`	Account	Create, delete R2 buckets
`Workers KV Storage:Edit`	Account	Create, delete KV namespaces
`Queues:Edit`	Account	Create, delete Queues

1.2–1.4 CF API Client, Resource Operations, Rate Limits — Phase 2 Design

Phase 2 design. See Provisioning Phase 2 Plan.

2. Provisioning Operations [Phase 1]

2.1 `BOOTSTRAP_PLATFORM` (formerly `PROVISION_PLATFORM`)

Triggered automatically on client registration via the platform.registered queue event (§0.1), or manually by the operator via the Zero UI. Creates the minimum viable resource set only — additional resources are provisioned lazily when features are activated (§2.2).

Input:

{
  platformId: string;
  planTier: string;              // 'starter' | 'growth' | 'scale'
  billingEmail: string;          // billing contact email
  defaultEntityId?: string;      // optional pre-generated tenant ID for the default tenant
  environment?: 'dev' | 'stg' | 'prod';  // default: 'prod'
}

Steps (in order) — minimal bootstrap only:

[Updated for DNS architecture] Resource names now use stack-id (default) instead of entity-id. DNS hostnames are registered via CF4SaaS at provisioning time.

Step	Action	Rollback action
1	Create `\{pid\}-default-auth-db` D1 (staging: `\{pid\}-default-auth-db-stg`)	Delete D1
2	Deploy `\{pid\}-default-auth` Worker (staging: `\{pid\}-default-auth-stg`)	Delete Worker
3	Set secrets on auth Worker	(no rollback — secrets deleted with Worker)
4	Run auth D1 migrations	(no rollback — migration is idempotent)
5	Create Stripe customer record	Cancel Stripe customer
6	Register platform + auth resources in Registry	Mark as deleted
7	Register DNS hostname `auth.svc.default.\{pid\}.nno.app` via CF4SaaS	Delete CF4SaaS custom hostname

Not included in bootstrap: CF Pages project, feature D1 databases, R2 buckets, KV namespaces — these are created on-demand when the relevant feature is activated (see §2.2). A client's platform is considered "live" (able to log in) as soon as the auth Worker and D1 are up.

2.2 `ACTIVATE_FEATURE`

Triggered when a client activates a feature from their console (via the feature.activating queue event, §0.2) or when the operator force-activates a feature from the Zero UI. Creates only the Cloudflare resources declared in the feature's FeatureManifest.resources block — nothing more.

Input:

{
  platformId: string;
  entityId: string; // required — ID of the entity (tenant/sub-tenant) activating the feature
  featureId: string; // e.g. 'analytics'
  featureVersion: string; // e.g. '1.2.0'
  environment: "dev" | "stg" | "prod";
  resources: FeatureResourceRequirements; // read from FeatureManifest.resources at activation time
}

FeatureResourceRequirements (declared in each feature's FeatureManifest):

interface FeatureResourceRequirements {
  worker?: boolean; // Deploy a CF Worker for this feature's backend API
  pages?: boolean; // Create a CF Pages project (SPA portal)
  d1?: boolean; // Create a D1 SQLite database
  r2?: boolean; // Create an R2 object storage bucket
  kv?: boolean; // Create a KV namespace
  queue?: boolean; // Create a Cloudflare Queue
  minimumPlan?: "starter" | "growth" | "scale"; // Plan gate — checked before job is queued
}

Steps (conditioned on resources.*):

Step	Condition	Action	Rollback
1	Always	Quota check via billing service	—
2	Always	Mark `feature_activation` as 'activating'	Mark as 'failed'
3	`resources.d1`	Create feature D1 database	Delete D1
4	`resources.d1`	Register D1 in Registry	Mark deleted
5	`resources.d1`	Run D1 migrations	(idempotent)
6	`resources.r2`	Create R2 bucket	Delete bucket
7	`resources.r2`	Register R2 in Registry	Mark deleted
8	`resources.kv`	Create KV namespace	Delete namespace
9	`resources.kv`	Register KV in Registry	Mark deleted
10	`resources.worker`	Deploy feature Worker	Delete Worker
11	`resources.worker`	Set secrets on Worker	(with Worker)
12	`resources.worker`	Register Worker in Registry	Mark deleted
13	`resources.pages`	Create CF Pages project	Delete Pages project
14	`resources.pages`	Connect to platform GitHub repo	Disconnect
15	`resources.pages`	Trigger initial Pages build	(build can fail independently)
16	Always	Update `feature_activation` status to 'active'	Mark as 'failed'
17	Always	Trigger platform shell rebuild via CLI Service	(rebuild can fail independently)

Plan enforcement: Before the job is enqueued, services/registry calls GET /api/v1/billing/quota/check?platformId=&resource= to verify the platform's plan permits the requested resources. A 402 response halts the activation with an upgrade prompt — no job is created.

2.3 `DEACTIVATE_FEATURE`

Gracefully removes a feature's resources. Data is preserved (D1 not deleted) by default.

Input:

{
  platformId: string;
  entityId: string;                         // required — ID of the entity whose feature resources are being removed
  featureId: string;
  environment: string;
  deleteData?: boolean;      // default: false — preserve D1 data
}

Steps:

Step	Action
1	Mark `feature_activation` as 'deactivating'
2	Delete feature Worker (stops serving traffic)
3	Mark Worker resource as deleted in Registry
4	If `deleteData: true` — delete D1 database
5	Mark D1 resource as deleted in Registry (or 'archived' if data preserved)
6	Mark `feature_activation` as 'inactive'
7	Trigger platform shell rebuild via CLI Service

2.3a `PROVISION_STACK`

Triggered when a platform admin activates a stack template (via the stack.activating queue event, §0.4). Provisions shared CF resources first, then activates each feature in the stack with shared resource bindings injected.

Input:

{
  platformId: string;
  stackInstanceId: string; // pre-created stacks record
  stackDefinition: StackDefinition; // resolved from stack-registry or inline (local)
  environment: "dev" | "stg" | "prod";
}

Steps (in order):

Step	Name	Condition	Action	Rollback
1	`resolve_stack_definition`	Always	Fetch StackDefinition from stack-registry (template) or use inline (local). Validate all featureIds exist.	—
2	`create_shared_d1`	`resources.sharedD1`	Create D1: `\{platformId\}-\{stackId\}-db-\{env\}`. Idempotent.	Delete D1
3	`register_shared_d1`	`resources.sharedD1`	Register D1 in Registry as `resource_type: 'd1'`, `service_name: '\{stackId\}'`	Mark deleted
4	`create_shared_r2`	`resources.sharedR2`	Create R2 bucket: `\{platformId\}-\{stackId\}-storage-\{env\}`. Idempotent.	Delete bucket
5	`register_shared_r2`	`resources.sharedR2`	Register R2 in Registry	Mark deleted
6	`create_shared_kv`	`resources.sharedKV`	Create KV namespace: `\{platformId\}-\{stackId\}-kv-\{env\}`. Idempotent.	Delete namespace
7	`register_shared_kv`	`resources.sharedKV`	Register KV in Registry	Mark deleted
8	`deploy_stack_worker`	`resources.worker`	Deploy Worker: `\{platformId\}-\{stackId\}-orchestrator-\{env\}`. Binds STACK_DB, STACK_STORAGE, STACK_KV as available.	Delete Worker
9	`activate_features`	Always	For each feature in `features[]` (in order): run `ACTIVATE_FEATURE` sub-job, injecting shared bindings: `STACK_DB → sharedD1.cfId`, `STACK_STORAGE → sharedR2.bucketName`, `STACK_KV → sharedKv.namespaceId`, `STACK_ID → stackInstanceId`. Required features: failure aborts entire `PROVISION_STACK` job. Optional features: failure logs warning, continues.	Per-feature rollback
10	`register_dns_hostnames`	Non-fatal	Register CF4SaaS custom hostnames for each app/service in the stack: `\{name\}.\{type\}.\{stackId\}.\{platformId\}.nno.app`. Staging: `\{name\}.\{type\}.stg.\{stackId\}.\{platformId\}.nno.app`. Record each hostname in Registry `dns_records` table. Failure does not abort the job.	Delete CF4SaaS hostnames
11	`register_stack_instance`	Non-fatal	`PATCH /platforms/\{platformId\}/stacks/\{stackInstanceId\}` → status: active, sharedResources: { d1Id, r2Name, kvId, workerName }	Mark FAILED
12	`trigger_shell_rebuild`	Non-fatal	POST to CLI Service with stackId context. Failure does not abort the job.	—

Shared resource bindings injected into each feature Worker:

// Bindings set on each feature Worker in the stack (at activation time)
{
  STACK_DB:      sharedD1?.cfId ?? undefined,
  STACK_STORAGE: sharedR2?.bucketName ?? undefined,
  STACK_KV:      sharedKv?.namespaceId ?? undefined,
  STACK_ID:      stackInstanceId,
}

Required vs optional feature failure:

required: true → any ACTIVATE_FEATURE sub-job failure transitions PROVISION_STACK to FAILED and triggers rollback of all completed steps
required: false → sub-job failure is logged as a warning; the step is marked SKIPPED; the stack instance status becomes DEGRADED instead of ACTIVE

Stack naming convention for shared resources:

D1:     {platformId}-{stackId}-db-{env}              e.g. k3m9p2xw7q-saas-starter-db-prod
R2:     {platformId}-{stackId}-storage-{env}         e.g. k3m9p2xw7q-saas-starter-storage-prod
KV:     {platformId}-{stackId}-kv-{env}              e.g. k3m9p2xw7q-saas-starter-kv-prod
Worker: {platformId}-{stackId}-orchestrator-{env}    e.g. k3m9p2xw7q-saas-starter-orchestrator-prod

See stacks.md for the full Stack architecture including shared resource prefix conventions.

2.3b Custom Domain Provisioning

When a client maps a custom domain to a platform resource, the provisioning service handles CF4SaaS registration:

Client submits custom domain (e.g. app.acmecorp.com) for dashboard.app.x7y8z9w0q1.a1b2c3d4e5.nno.app
  │
  ▼
Registry: create dns_records row (status: pending_validation)
  │
  ▼
Provisioning: POST CF4SaaS custom hostname API
  → CF returns validation TXT record details
  │
  ▼
Registry: update dns_records row with validation details
Client: add CNAME + TXT records in their DNS registrar
  │
  ▼
Provisioning: poll CF4SaaS until SSL status = active (or timeout)
  → CF validates ownership and provisions TLS certificate automatically
  │
  ▼
Registry: update dns_records row (status: active)

Custom hostname records are stored in the Registry dns_records table. See dns-naming.md for the full CF4SaaS flow and registry.md for the dns_records schema.

2.4 `DEPROVISION_PLATFORM`

Full teardown of a platform. Requires explicit confirmation — irreversible.

Steps: Reverse of BOOTSTRAP_PLATFORM plus all activated feature resources, in reverse dependency order.

3. State Machine [Phase 1]

Every provision_job follows this state machine:

                     ┌──────────┐
                     │  PENDING │  ← job created in Registry
                     └────┬─────┘
                          │ worker picks up job
                     ┌────▼─────┐
                     │ RUNNING  │  ← steps executing
                     └────┬─────┘
              ┌───────────┼───────────┐
              │           │           │
         ┌────▼────┐  ┌───▼───┐  ┌───▼──────┐
         │COMPLETED│  │FAILED │  │ TIMED_OUT│
         └─────────┘  └───┬───┘  └────┬─────┘
                          │            │
                    ┌──── ▼────────────▼──────┐
                    │  retry_count < max_retries│
                    └────┬──────────────────────┘
                         │ yes                no
                    ┌────▼──────┐       ┌────────────┐
                    │  PENDING  │       │ ROLLING_BACK│
                    │(re-queued)│       └─────┬───────┘
                    └───────────┘             │
                                        ┌─────▼──────┐
                                        │ ROLLED_BACK│
                                        └────────────┘

Step Tracking

Each completed step is recorded in provision_jobs.steps as a JSON array. This enables precise rollback — only steps that completed successfully need to be reversed:

// provision_jobs.steps (after partial completion)
[
  {
    "step": 1,
    "action": "create_d1",
    "status": "completed",
    "output": { "cf_id": "fa098e4d-...", "resource_id": "res_abc123" },
    "completedAt": 1740000001000
  },
  {
    "step": 2,
    "action": "deploy_worker",
    "status": "completed",
    "output": {
      "script_name": "k3m9p2xw7q-r8n4t6y1z5-auth-prod",
      "resource_id": "res_def456"
    },
    "completedAt": 1740000008000
  },
  {
    "step": 3,
    "action": "set_secrets",
    "status": "failed",
    "error": "CF API 429: rate limited",
    "failedAt": 1740000012000
  }
]

4. Retry & Backoff [Phase 1]

Retry Strategy

async function withRetry<T>(
  fn: () => Promise<T>,
  job: ProvisionJob,
  step: number,
): Promise<T> {
  const maxRetries = 3;
  let attempt = 0;

  while (attempt <= maxRetries) {
    try {
      return await fn();
    } catch (err) {
      if (err instanceof RateLimitError) {
        // Honour Cloudflare's Retry-After header
        await sleep(err.retryAfterSeconds * 1000);
        attempt++;
        continue;
      }

      if (err instanceof CloudflareApiError && err.isTransient()) {
        // 500, 502, 503, 504 — exponential backoff
        const delay = Math.min(1000 * Math.pow(2, attempt), 30_000);
        await sleep(delay);
        attempt++;
        continue;
      }

      // Non-transient error (400, 409, etc.) — fail immediately
      throw err;
    }
  }

  throw new MaxRetriesExceededError(
    `Step ${step} failed after ${maxRetries} retries`,
  );
}

Transient vs Non-Transient Errors

CF API Status	Type	Behaviour
429	Transient	Retry after `Retry-After` header value
500 / 502 / 503 / 504	Transient	Retry with exponential backoff (1s, 2s, 4s)
400	Non-transient	Fail immediately — request is malformed
403	Non-transient	Fail immediately — permissions misconfigured
409	Non-transient (idempotency hit)	Resource already exists — treat as success

5. Idempotency [Phase 1]

Every create operation checks whether the resource already exists before calling the CF API. This makes jobs safe to re-run after partial failure:

async function ensureD1Exists(
  cf: CloudflareClient,
  name: string,
): Promise<{ cfId: string; wasCreated: boolean }> {
  // Check Registry first (fastest path)
  const existing = await registry.resources.lookupByCfName(name);
  if (existing?.status === "active") {
    return { cfId: existing.cfId, wasCreated: false };
  }

  // Check CF API (in case Registry is stale)
  const cfExisting = await cf.d1.findByName(name);
  if (cfExisting) {
    return { cfId: cfExisting.uuid, wasCreated: false };
  }

  // Create
  const created = await cf.d1.create({ name });
  return { cfId: created.uuid, wasCreated: true };
}

The 409 Conflict response from the CF API is treated as a success (not an error) for create operations.

6. Rollback [Phase 2]

Phase 2 design. See Provisioning Phase 2 Plan.

Phase 2 also introduces the typed CloudflareClient class (services/provisioning/src/cf-client/index.ts) and the rollback engine (reverse-iterate completed steps, call CF delete APIs). See docs/implementation/phase-2/provisioning.md for full design.

7. Job Worker (Queue Consumer) [Phase 1]

Phase 1 (current): The Cloudflare Queue consumer (executors/queue-consumer.ts) and PROVISION_QUEUE binding are already live. Jobs are enqueued on POST /provision/* and processed asynchronously by the queue consumer.

Phase 2 design (queue event wiring from IAM/Registry). See Provisioning Phase 2 Plan.

7.1 Dead Letter Queue (DLQ) [Phase 2]

When a provisioning job exceeds its retry budget (3 attempts by default), the Cloudflare Queue moves the message to the Dead Letter Queue for operator review and alerting.

Property	Value
Queue name	`nno-k3m9p2xw7q-provisioning-dlq-\{env\}`
KV namespace	`NNO_PROVISIONING_DLQ_KV` — stores DLQ message metadata for operator queries
Alert trigger	DLQ depth > 0 triggers an operator alert (email/Slack)

Operator endpoints (Phase 2, mounted in services/provisioning):

GET  /api/v1/provision/dlq                  List DLQ messages (paginated)
GET  /api/v1/provision/dlq/:messageId       Get DLQ message details + failed steps
POST /api/v1/provision/dlq/:messageId/retry Re-enqueue a DLQ message for retry
POST /api/v1/provision/dlq/:messageId/dismiss Mark as resolved without retry

DLQ messages retain the original job payload and the steps array showing which step caused the final failure. Operators can inspect the error, fix the underlying issue (e.g., CF API permission), and retry without recreating the provisioning request.

8. Provisioning API [Phase 1]

The Provisioning Service exposes an internal API consumed by the NNO Gateway (not exposed to clients directly):

POST   /api/v1/provision/platform             Trigger BOOTSTRAP_PLATFORM
POST   /api/v1/provision/feature/activate     Trigger ACTIVATE_FEATURE
POST   /api/v1/provision/feature/deactivate   Trigger DEACTIVATE_FEATURE
POST   /api/v1/provision/stack/activate       Trigger PROVISION_STACK
POST   /api/v1/provision/stack/deactivate     Trigger DEACTIVATE_STACK
POST   /api/v1/provision/platform/deprovision Trigger DEPROVISION_PLATFORM (requires confirmation token)
GET    /api/v1/provision/jobs/:jobId          Get job status + step details
GET    /api/v1/provision/jobs                 List jobs for a platform (query: ?platformId=&limit=25&cursor=)

All routes are mounted at /api/v1/provision in services/provisioning/src/index.ts. The service is internal and accessed via the NNO Gateway only.

Pagination: GET /api/v1/provision/jobs uses cursor-based pagination consistent with the NNO Registry pagination standard. Response envelope: \{ "data": [...], "pagination": \{ "hasMore": bool, "nextCursor": string | null \} \}. Provisioning jobs are append-only and never deleted, so cursor and offset are semantically equivalent for this dataset — cursor is used for consistency across the platform.

POST /api/v1/provision/feature/activate request:

{
  "platformId": "k3m9p2xw7q",
  "entityId": "r8n4t6y1z5",
  "featureId": "analytics",
  "featureVersion": "1.2.0",
  "environment": "prod"
}

All POST /api/v1/provision/* responses:

202 Accepted
{ "jobId": "job_n3r8t5w2y6", "status": "PENDING" }

The job is enqueued and processed asynchronously by the Cloudflare Queue consumer. The HTTP response always returns PENDING; use GET /api/v1/provision/jobs/:jobId to poll for the actual job status and step results.

Schema note: The provisioning_jobs table has an environment TEXT column added by migrations/0002_add_environment.sql. This stores the target environment ('dev' | 'stg' | 'prod') for each job. Note: the provisioning service has its own provisioning_jobs table (in its own D1) which is separate from the Registry's provision_jobs table — they track different aspects of the same operation. The provisioning service uses uppercase type values (e.g., BOOTSTRAP_PLATFORM, ACTIVATE_FEATURE). The Registry uses lowercase operation values (e.g., provision_platform, activate_feature) in a different field.

GET /api/v1/provision/jobs/:jobId response:

{
  "id": "job_n3r8t5w2y6",
  "type": "ACTIVATE_FEATURE",
  "status": "RUNNING",
  "platformId": "k3m9p2xw7q",
  "entityId": "r8n4t6y1z5",
  "featureId": "analytics",
  "environment": "prod",
  "steps": [
    {
      "name": "create_d1",
      "status": "COMPLETED",
      "result": { "cfId": "fa098e4d-...", "message": "D1 database created" },
      "startedAt": 1740230460000,
      "completedAt": 1740230462000
    },
    {
      "name": "deploy_worker",
      "status": "RUNNING",
      "result": null,
      "startedAt": 1740230465000,
      "completedAt": null
    }
  ],
  "error": null,
  "createdAt": 1740230460000,
  "startedAt": 1740230461000,
  "completedAt": null
}

9. Observability [Phase 1]

All provisioning operations emit structured logs to Cloudflare Logpush:

// Every step logs at start and end
console.log(
  JSON.stringify({
    level: "info",
    event: "provision_step",
    jobId: job.id,
    step: step.number,
    action: step.action,
    status: "started" | "completed" | "failed",
    durationMs: elapsed,
    cfResource: cfName,
    error: err?.message ?? null,
  }),
);

Key metrics to alert on:

Job failure rate > 5% over 1 hour
Job duration > 120 seconds (P95)
DLQ depth > 0 (any job hitting DLQ)
CF API error rate > 1% (watch for permission or quota issues)
Workers daily deploy count approaching 200 (CF account limit)

10. Data Retention Policy [Phase 1]

This section documents how Cloudflare resources and Registry records are treated when a feature or stack is deactivated. All policies here reflect Phase 1 actual behaviour (grounded in the executor source code). Time-based hot/cold/permanent deletion tiers are not implemented in Phase 1.

Feature Deactivation (`DEACTIVATE_FEATURE`)

The executeDeactivateFeature executor (executors/deactivate-feature.ts) only deletes the Worker. The D1 database is never deleted, regardless of request parameters:

DEACTIVATE_FEATURE job
  ├─ Step 1: Mark feature → status: "deactivating" (Registry PATCH)
  ├─ Step 2: Delete feature Worker via cf.workers.delete()  ← only CF resource removed
  ├─ Step 3: Mark feature → status: "inactive" (Registry PATCH)
  └─ Step 4: Trigger shell rebuild (CLI Service)

deleteData flag (Phase 2): The POST /feature/deactivate route accepts deleteData: boolean in the request body, and the DeactivateFeatureSchema validates it. However, the DEACTIVATE_FEATURE executor does not act on this flag in Phase 1 — D1 data is always preserved. deleteData support for standalone features is a Phase 2 addition.

Post-deactivation state:

Resource	Status after deactivation
Feature Worker	Deleted from Cloudflare
Feature D1 database	Preserved — still exists in Cloudflare account
D1 data	Preserved — accessible via CF Dashboard or direct D1 API
Registry `feature_activation` record	Updated to `status: "inactive"`
Registry `resource` records (D1, Worker)	Worker resource marked deleted; D1 resource retains its `cfId`

There is no automatic expiry or cleanup of preserved D1 databases in Phase 1. They persist until manually deleted via the CF Dashboard or a future deleteData: true implementation.

Stack Deactivation (`DEACTIVATE_STACK`)

The executeDeactivateStack executor (executors/deactivate-stack.ts) respects the deleteData flag for shared resources:

DEACTIVATE_STACK job
  ├─ Step 1: Resolve stack instance (Registry GET)
  ├─ Step 2: Mark stack → status: "deactivating" (Registry PATCH)
  ├─ Step 3: Enqueue DEACTIVATE_FEATURE sub-job for each feature activation
  ├─ Step 4: Delete shared CF resources — ONLY if deleteData: true
  │           ├─ cf.workers.delete(workerName)   if sharedResources.workerName
  │           ├─ cf.d1.delete(d1Id)              if sharedResources.d1Id
  │           ├─ cf.r2.delete(r2Name)            if sharedResources.r2Name
  │           └─ cf.kv.delete(kvId)              if sharedResources.kvId
  ├─ Step 5: Mark stack → status: "deactivated" (Registry PATCH)
  └─ Step 6: Trigger shell rebuild (CLI Service)

Post-deactivation state by deleteData flag:

Resource	`deleteData: false` (default)	`deleteData: true`
Per-feature Workers	Deleted (by DEACTIVATE_FEATURE sub-jobs)	Deleted
Shared stack D1	Preserved in Cloudflare	Deleted via `cf.d1.delete()`
Shared stack R2 bucket	Preserved in Cloudflare	Deleted via `cf.r2.delete()`
Shared stack KV namespace	Preserved in Cloudflare	Deleted via `cf.kv.delete()`
Stack orchestration Worker	Preserved in Cloudflare	Deleted via `cf.workers.delete()`
Registry `stacks` record	Updated to `status: "deactivated"`	Updated to `status: "deactivated"`

R2 cost note: Preserved R2 buckets accrue Cloudflare storage costs against the NNO account even after stack deactivation. Operators should monitor for deactivated stacks with large R2 buckets.

Re-activation with Preserved Data

Because D1 databases are preserved after deactivation, re-activating a feature or stack can reconnect to existing data:

Feature re-activation (ACTIVATE_FEATURE): The executor checks whether a D1 with the expected name already exists (cf.d1.findByName(dbName)) before creating a new one. If found, it reuses the existing database and its data.
Stack re-activation (PROVISION_STACK): Same idempotency check applies to shared D1, R2, and KV resources — existing resources are reused, not recreated.

Phase 2 Planned Additions

Phase 2 additions. See Provisioning Phase 2 Plan.

11. Wrangler Configuration

# services/provisioning/wrangler.toml

name = "nno-k3m9p2xw7q-provisioning"
main = "src/index.ts"
compatibility_date = "2024-09-13"
compatibility_flags = ["nodejs_compat"]

[triggers]
crons = ["*/15 * * * *"]   # ssl-poller: checks pending CF4SaaS SSL issuance

# Production (default — no --env flag)
[[d1_databases]]
binding = "DB"
database_name = "nno-k3m9p2xw7q-provisioning-db"
database_id = "db158316-1fa6-4fbf-a758-bfff40fb0e46"
migrations_dir = "migrations"

[[queues.producers]]
binding = "PROVISION_QUEUE"
queue = "nno-k3m9p2xw7q-provision-queue"

[[queues.consumers]]
queue = "nno-k3m9p2xw7q-provision-queue"
max_batch_size = 1
max_retries = 3
dead_letter_queue = "nno-k3m9p2xw7q-provision-dlq"

[[queues.producers]]
binding = "PROVISION_DLQ"
queue = "nno-k3m9p2xw7q-provision-dlq"

[[analytics_engine_datasets]]
binding = "NNO_METRICS"
dataset = "nno_metrics"

[[kv_namespaces]]
binding = "NNO_PLATFORM_STATUS_KV"
id = "63deeb49457946c6a68a49b23ea4fc5c"

[env.stg]
name = "nno-k3m9p2xw7q-provisioning-stg"
[[env.stg.d1_databases]]
binding = "DB"
database_name = "nno-k3m9p2xw7q-provisioning-db-stg"
database_id = "74b91da9-0e38-4373-bd74-e95ed67dd210"
migrations_dir = "migrations"
[[env.stg.queues.producers]]
binding = "PROVISION_QUEUE"
queue = "nno-k3m9p2xw7q-provision-queue-stg"
[[env.stg.queues.consumers]]
queue = "nno-k3m9p2xw7q-provision-queue-stg"
max_batch_size = 1
max_retries = 3
dead_letter_queue = "nno-k3m9p2xw7q-provision-dlq-stg"
[[env.stg.queues.producers]]
binding = "PROVISION_DLQ"
queue = "nno-k3m9p2xw7q-provision-dlq-stg"
[[env.stg.analytics_engine_datasets]]
binding = "NNO_METRICS"
dataset = "nno_metrics"
[[env.stg.kv_namespaces]]
binding = "NNO_PLATFORM_STATUS_KV"
id = "db49667ca1904af7ba128a02806e7c7e"

Bindings Summary

Binding	Type	Purpose
`DB`	D1	Provisioning job state and step tracking
`PROVISION_QUEUE`	Queue producer	Enqueues provisioning jobs for async execution
`PROVISION_DLQ`	Queue producer	Dead-letter queue for jobs that exhaust retries
`NNO_METRICS`	Analytics Engine	Structured metrics and observability events
`NNO_PLATFORM_STATUS_KV`	KV namespace	Written by provisioning on state transitions; read by gateway for enforcement. Key format: `platform:\{platformId\}:status`. Values: `"active"` \| `"suspended"` \| `"deprovisioned"`

Secrets (per environment)

Secret	Description
`CF_API_TOKEN`	Cloudflare API token for resource provisioning (see §1.1 for required permissions)
`CF_ACCOUNT_ID`	Cloudflare account ID
`CF_ZONE_ID`	Cloudflare Zone ID — required by the `ADD_CUSTOM_DOMAIN` executor for CF4SaaS custom hostname registration
`NNO_REGISTRY_URL`	Internal URL of the NNO Registry service
`NNO_INTERNAL_API_KEY`	Shared secret for service-to-service calls to Registry
`AUTH_API_KEY`	Shared secret for inbound service-to-service auth
`CORS_ORIGINS`	Comma-separated allowed origins
`NNO_AUTH_BUNDLE_URL`	URL to pre-built auth Worker JS bundle (CI artifact or R2 public URL) — used by `UPGRADE_AUTH_WORKER`

Status: Detailed design — PROVISION_STACK added 2026-02-28 Implementation target: services/provisioning/ Related: NNO Registry · System Architecture · Feature Package SDK · Stacks · Stack Registry

NNO Provisioning Service

On this page