Introduction

There is a pattern I have watched repeat itself across engineering organizations of every size. A company adopts DevOps. Delivery improves. The team grows. Then, quietly, a new kind of friction sets in — not between dev and ops, but inside every individual team. Engineers spend meaningful portions of their week managing CI/CD pipelines, writing Terraform, configuring dashboards, and debugging secret injection failures. The work is not their product. It is infrastructure tax.

Platform Engineering is the architectural response to that problem. This post is a deep dive into what it actually takes to build an Internal Developer Platform (IDP) that delivers on the promise — not just the theory, but the team structure, the technical components, the implementation sequence, and the mistakes worth avoiding.

Table of Contents


The Problem Platform Engineering Actually Solves

Before defining what an IDP is, it is worth being precise about the pain it targets.

In a typical organization that has scaled past a handful of teams, each product team is expected to own its full delivery stack. That means writing and maintaining GitHub Actions workflows, managing cloud resource definitions in Terraform or Pulumi, owning their monitoring dashboards, handling secrets rotation, and keeping up with base image updates for security patching.

This model has a name: you build it, you run it. It is sound in principle. The issue is that at scale, it produces a situation where 30 or 40 teams are each solving the same set of infrastructure problems — independently, inconsistently, and at a compounding cost to their actual product work.

The result is predictable:

  • Onboarding a new engineer requires understanding not just the product domain but a bespoke local infrastructure setup that exists nowhere else in the company.
  • Security patching means a wave of PRs across dozens of repositories.
  • Incident response is complicated by inconsistent observability configurations — one team ships logs to CloudWatch, another to Loki, a third to both.
  • Platform capabilities (feature flags, rate limiting, service mesh config) get reinvented team by team.

Platform Engineering recentralizes the solution without recentralizing the control. A dedicated platform team builds and maintains a curated set of reusable capabilities. Product teams consume them through self-service interfaces. The platform team does not become a bottleneck — it becomes infrastructure that scales.


Platform Engineering Is Not DevOps Rebranded

This distinction matters. DevOps is a cultural and organizational shift: it breaks down the wall between development and operations and places shared ownership of delivery across both. Platform Engineering does not replace that. It builds on top of it.

The key architectural difference is intent:

  DevOps Platform Engineering
Primary concern Delivery collaboration Developer cognitive load
Mechanism Shared ownership Curated self-service
Output Culture and practices A platform as a product
Team model Cross-functional squads Platform team + product teams
Scaling model Distributed Centralized capabilities, distributed consumption

A useful heuristic: if your DevOps implementation means every team owns their own Terraform modules, pipeline configs, and monitoring setup — you have distributed the work without reducing it. Platform Engineering asks whether that work can be abstracted, centralized, and offered through an interface that product teams do not need to understand deeply.


The Four Pillars of a Production-Grade IDP

An IDP is not a single tool. It is a composed layer of capabilities. These are the four most commonly prioritized in practice:

1. Self-Service Infrastructure Provisioning

The goal here is eliminating infrastructure tickets. Instead of a product team raising a request for a new PostgreSQL database, message queue, or Kubernetes namespace, they provision it themselves through a service catalog or declarative template — and the platform team’s approved, policy-compliant configuration handles the rest.

The most mature implementations use Crossplane for this. Crossplane runs inside your Kubernetes cluster and lets platform teams define Composite Resource Definitions (XRDs) — reusable abstractions over cloud resources. Product teams provision a Database or MessageQueue the same way they’d create any Kubernetes resource:

apiVersion: platform.company.io/v1alpha1
kind: PostgreSQLInstance
metadata:
  name: orders-db
  namespace: orders-team
spec:
  parameters:
    storageGB: 50
    tier: standard
    region: eu-west-1
  writeConnectionSecretToRef:
    name: orders-db-credentials

Behind the scenes, Crossplane translates this into the appropriate AWS RDS or Azure Database resource, applies the platform team’s approved configuration (encryption, backup policy, VPC placement), and writes the connection secret directly into the namespace. The product team never touches the cloud console or writes infrastructure code.

Kratix offers an alternative model where platform teams define “promises” — contracts describing what a capability provides and how it is fulfilled — with multi-cluster support built in from the start.

Both approaches integrate naturally with Argo CD or Flux for GitOps-based reconciliation, meaning all provisioned infrastructure is declarative, auditable, and version-controlled.


2. Golden Path CI/CD Pipelines

The term “golden path” refers to the paved, recommended route through the delivery lifecycle. The platform team maintains it. Product teams walk it by default, with the option to diverge when genuinely needed.

In GitHub Actions, this means reusable workflow templates stored in a central repository and versioned:

# .github/workflows/deploy.yml (in a product team repo)
name: Deploy

on:
  push:
    branches: [main]

jobs:
  deploy:
    uses: company/platform-workflows/.github/workflows/service-deploy.yml@v2.4.0
    with:
      service-name: orders-api
      environment: production
      image-registry: ghcr.io/company
    secrets: inherit

The referenced workflow at v2.4.0 handles the full pipeline: building and tagging the container image, running SAST and container scanning (Trivy, Snyk), signing the image with Cosign, publishing to the registry, and triggering a GitOps deployment. When the platform team updates the base image or upgrades the scanner, it ships as a new version of the template. Product teams opt in by bumping their version reference — one PR each, or automated via Dependabot.

The strategic value here is compounding. Instead of security improvements requiring a manual PR to every service repository, the platform ships them once. Adoption is the product team’s choice, but the work is the platform team’s.


3. Centralized Observability

Inconsistent observability is one of the most expensive operational problems at scale. When each team ships metrics and logs to different systems with different naming conventions, cross-service incident analysis becomes manual and slow.

A platform-owned observability stack solves this by providing a pre-configured, consistent baseline per workload type. The standard modern stack for Kubernetes:

Metrics:   Prometheus (collection) → Grafana (dashboards + alerting)
Logs:      Promtail/Alloy (collection) → Loki (storage) → Grafana
Traces:    OpenTelemetry SDK → Tempo → Grafana

The platform team defines and maintains:

  • Grafana dashboard templates provisioned automatically per namespace, covering the RED metrics (Rate, Errors, Duration) for HTTP services and standard infrastructure metrics for databases and queues.
  • Alerting baselines applied by default — latency SLOs, error rate thresholds, pod restart alerts — with product teams able to extend but not opt out.
  • OpenTelemetry Collector configuration deployed as a DaemonSet, so teams get traces by instrumenting their application once, with no per-team collector configuration.

When a new service is registered in the platform’s service catalog, a Grafana workspace and default alert rules are provisioned automatically. The team opens Grafana on day one and has meaningful dashboards immediately.


4. Security and Policy Enforcement

Security should be a property of the platform, not a responsibility of individual teams. Admission controllers make this enforceable.

OPA/Gatekeeper and Kyverno are the two dominant options for Kubernetes policy enforcement. Both intercept resource creation at the API server level and validate or mutate resources against defined policies before they are admitted.

A representative policy set applied at the platform layer:

# Kyverno policy: require signed images
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-signed-images
spec:
  validationFailureAction: Enforce
  rules:
    - name: verify-image-signature
      match:
        resources:
          kinds: [Pod]
      verifyImages:
        - imageReferences:
            - "ghcr.io/company/*"
          attestors:
            - entries:
                - keyless:
                    subject: "https://github.com/company/*"
                    issuer: "https://token.actions.githubusercontent.com"

This policy ensures that any pod deploying a company-owned image must have a verifiable Sigstore/Cosign signature — enforced at admission, not audited after the fact. Unsigned images are blocked. Teams cannot accidentally deploy unscanned builds to production.

Combined with network policies that default to deny-all per namespace, secret store integrations (External Secrets Operator pulling from Vault or AWS Secrets Manager), and RBAC baselines provisioned per team namespace, the security posture becomes consistent and auditable across every team — without requiring any security-specific work from product engineers.


The Team That Runs It

The technical components above require the right organizational model to be effective. The platform team must operate with a product mindset, not a service mindset. That distinction is practical, not philosophical.

A service-minded platform team responds to requests, builds what is asked, and measures itself by ticket closure rate. A product-minded platform team maintains a roadmap, conducts internal user research (talking to product teams about where they lose time), tracks adoption metrics, and makes deliberate decisions about what the platform will and will not support.

Concretely, this means:

  • A developer portal — typically Backstage — as the front door to the platform. Service catalog, self-service scaffolding, documentation, and runbooks live here. Product teams should never need to ask the platform team how to do something basic.
  • Platform SLAs — the platform team commits to availability and support response times for the capabilities it owns. If the CI pipeline is the platform team’s product, a broken golden path workflow is a P1 for the platform team, not the product team’s problem to debug.
  • Adoption metrics over capability metrics — measuring how many teams are using the golden path is more useful than measuring how many features the golden path has. Unused capabilities are waste.

Team size guidance: at 10–20 product teams, a platform team of 3–5 engineers is sufficient to build and maintain a baseline IDP. At 40+ teams, 6–10 engineers is appropriate. The ratio scales sublinearly — the platform’s leverage increases as adoption grows.


Implementation Sequence: How to Start Without Boiling the Ocean

The most common mistake in IDP adoption is attempting to build everything at once. A platform that is six months from being useful is not a platform — it is a project.

The recommended sequence:

Phase 1 — Identify and solve the single highest-friction point. Interview five to ten engineers across different product teams. Ask where they lose the most time on non-product work. The answer is almost always consistent within an organization: it is usually onboarding, or CI/CD fragmentation, or observability gaps. Build one thing well. Ship it. Measure adoption.

Phase 2 — Establish the service catalog and golden path. Backstage + a reusable GitHub Actions workflow library. This is the highest-leverage combination — it gives teams a single place to register services, a documented recommended path, and immediate CI/CD consistency. Most organizations see meaningful adoption within the first month.

Phase 3 — Add self-service infrastructure provisioning. Crossplane XRDs for the most common resource types (databases, queues, storage buckets). Integrate with the service catalog so provisioning is triggered from the developer portal.

Phase 4 — Close the observability and security gaps. Roll out the centralized observability stack and admission controller policies. Apply to new services first; migrate existing services on a team-by-team cadence.


Measuring Success

Platform Engineering investment is justified by the return it delivers to product teams. These are the metrics that matter:

  • Time-to-production for a new service — from repository creation to first production deployment. Target: under four hours with a mature IDP.
  • Self-service ratio — percentage of infrastructure provisioning requests fulfilled without platform team involvement. Target: >90%.
  • Pipeline standardization coverage — percentage of services using the golden path. Track by team.
  • Mean time to onboard a new engineer — time from first day to first production PR merged.
  • Security patching lead time — time from vulnerability disclosure to platform-wide remediation. A centralized pipeline template makes this a single PR.

What Mature Looks Like

A mature IDP is largely invisible to product engineers. They scaffold a new service from the Backstage catalog, get a repository with a configured pipeline, push their first commit, and watch it deploy to a staging environment — with dashboards, alerting, and log aggregation already in place. They provision a database with a YAML file. Secrets appear in their namespace. Network policies are applied. Image signing is enforced.

The platform team is not involved in any of that. They are working on the next capability, talking to teams about what is still painful, and monitoring adoption.

That invisibility is the goal. The best platform is one engineers use without thinking about it.


Closing Thoughts

Platform Engineering is not a technology choice — it is an architectural and organizational bet that centralizing delivery infrastructure into a curated, self-service product creates more value than distributing that responsibility across every team.

The evidence from organizations that have made this investment is consistent: onboarding accelerates, security posture improves, and engineering teams spend more time building the products their organizations actually need.

The tools are mature. Crossplane, Backstage, Argo CD, Kyverno, the OpenTelemetry ecosystem — all are production-ready and widely deployed. The harder work is organizational: forming a team with a genuine product mandate, resisting the pull toward custom one-off solutions, and measuring the platform by adoption rather than by features shipped.

Start with the highest-friction point. Build it well. Ship it. The platform compounds from there.


If you are working through IDP adoption or have questions about any of the patterns covered here, reach out — I am always happy to dig into the specifics.