Quick Commands

# edit + deploy
git status
git add -A
git commit -m "docs: update"
git push

# rebuild static blog output (local)
cd site
npm ci --no-audit --no-fund
npm run build

# VPS: pull only
# (on server)
git pull --ff-only

End-to-End PKI & TLS Automation with HashiCorp Vault, Podman, and systemd User Services

Abstract

This document describes how we built a reproducible, environment-agnostic TLS pipeline for applications and proxies using HashiCorp Vault. It covers creating an offline Root CA, provisioning an Intermediate CA inside Vault, enabling HTTPS for Vault itself, and automating certificate issuance and CA-chain refresh with systemd user-level Vault Agents and a lightweight post-hook that reloads containers via a common label. The paper documents key design choices, issues encountered (permissions, capabilities, HTTPS trust), and the fixes that led to a stable setup.


Background & Context

The goal was to standardize TLS across services with minimal per-app customization. Requirements included:

  • A secure Root CA kept offline.
  • An Intermediate CA managed by Vault for day-to-day issuance.
  • Vault served over HTTPS on localhost, with a proper server certificate and chain.
  • Automatic issuance/rotation of leaf certs for apps and automatic refresh of the issuing CA chain for reverse proxies.
  • A single, uniform reload mechanism for containers (Podman) using a reusable label (tls=true).
  • Scripts and config that work consistently across test and prod environments without special-case paths or names.

This was necessary to reduce manual handling of certificates, tighten security guarantees, and make rollouts/reloads predictable and repeatable.


Implementation & Decisions

1) Offline Root CA (01_make_offline_root_ca.sh)

What: Creates an offline Root CA per environment at ~/vault/offline-root/<env>, producing root-ca.key (kept offline) and root-ca.pem (public).
Why: Keeping the Root offline limits blast radius. Only the public root is distributed or referenced elsewhere.

2) Intermediate PKI in Vault (02_intermediate_in_vault_sign_with_root.sh)

What: Enables an Intermediate PKI mount (e.g., pki-test), generates a CSR inside Vault, signs it with the offline Root via OpenSSL (v3 intermediate profile), then installs the signed Intermediate cert back into Vault. It also configures AIA/CRL URLs and copies the public root to /home/vault/tls-<env>/root_ca.pem. Why: The Intermediate performs all operational issuance. AIA/CRL URLs enable validation. Copying the public root makes later server issuance and trust bootstrapping easier.

3) Vault Server Certificate (03_issue_vault_server_cert.sh)

What: Creates/uses a PKI role (EC keys, server usage) and issues the Vault server certificate with DNS/IP SANs for 127.0.0.1. Writes:

  • /home/vault/tls-<env>/server.key (0600)
  • /home/vault/tls-<env>/server.crt (0644)
  • /home/vault/tls-<env>/fullchain.crt (leaf + intermediate)
  • /home/vault/tls-<env>/ca_chain.pem (intermediate + root)

Why: Proper file modes and split artifacts match typical TLS consumers and simplify Docker/Podman mounts.

4) Vault over HTTPS (Podman Compose)

What: Vault container serves HTTPS on 127.0.0.1:22300 with volume mounts for config, file, and tls-<env>.
Why & Lessons:

  • Initial failures stemmed from permission errors (server.key access) and attempts by the container to chown read-only mounts. We eliminated in-container chown expectations and ensured host-side ownership/modes were correct.
  • Running with userns keep-id triggered capability errors in some cases; we favored a configuration that does not rely on container-side chown or file capabilities.

5) Post-Hook for Agents (scripts/vault-agent-post.sh)

What: A minimal post-hook reads a JSON payload:

  • If it contains a leaf (private_key), it writes <app>.key and <app>.fullchain.pem into the app’s TLS dir.
  • If it contains only a chain, it writes the CA chain to the proxy path.

Then it reloads Podman containers labeled tls=true by exec’ing inside them (e.g., nginx -s reload).
Why: A uniform reload path avoids per-app logic. The label decouples the agent from specific container names.

6) CA Distribution to Users (distribute_ca_to_agents.sh)

What: Copies a trusted CA (Root or chain) to /home/<user>/vault/ca/ca.pem with correct ownership/mode, idempotently.
Why: Agents talking to Vault over HTTPS must trust Vault’s chain. This removes ad-hoc manual copying and fixes “unknown authority” errors.

7) App Leaf Agent (setup-vault-agent-app-config2.sh)

What: Per app:

  • Creates minimal policies and an AppRole with a periodic token.
  • Writes vault-agent.hcl and a template that issues a leaf from the Intermediate.
  • Sets vault { address = "https://127.0.0.1:22300"; ca_cert = "/home/<user>/vault/ca/ca.pem" }.
  • Uses a file sink (mode = 0600) for the token.
  • Installs a minimal systemd user service (no hardening flags) to avoid 218/CAPABILITIES issues.
  • Post-hook writes files to ~/tls/ and triggers the label-based reload.

Why: Minimal units proved robust on hosts where systemd hardening and file capabilities caused agent startup failures. A file sink avoids /dev/null temp-file errors and ensures integer mode is accepted.

8) Proxy CA Agent (setup-vault-agent-proxy-config2.sh)

What: Similar structure to the app agent, but the template reads the issuing CA (pki-test/cert/ca) and the post-hook writes the chain to /home/<proxyuser>/nginx/ca/current-ca-chain.pem. Why: Keeps chain refresh separate from leaf issuance (principle of least privilege) and uses the same minimal systemd unit pattern to avoid capability drop failures.

9) Switch AIA/CRL URLs to HTTPS

What: After Vault serves HTTPS successfully, the PKI mount URLs are updated to https://127.0.0.1:22300/v1/<mount>/{ca,crl}.
Why: Ensures downstream validation paths reference TLS-protected endpoints.


Timeline / Evolution

  1. Root CA Creation: Succeeded; an early verification path typo was fixed to point to the new offline-root layout.
  2. Intermediate Setup: Enabled pki-test, CSR generated and signed by the offline Root. AIA warnings noted; URLs later switched to HTTPS.
  3. Vault Server Cert Issuance: Files generated under /home/vault/tls-test. Initial container start failed due to key permissions and chown attempts; volumes and modes were corrected.
  4. HTTPS & Unseal: Health check confirmed HTTPS; unseal initially failed due to trust; resolved by distributing Vault’s CA to operators/agents and/or configuring ca_cert.
  5. Agent Rollout (App): First runs hit 218/CAPABILITIES due to hardening. Switching to a minimal unit and a file token sink fixed startup and /dev/null temp-file errors.
  6. Agent Rollout (Proxy): Same capability issue; resolved with minimal unit. A manual run confirmed .ca.json rendering and CA chain write. Label-based reload ran (skipped if no labeled containers).
  7. Stabilization: Uniform reload label (tls=true), idempotent CA distribution, and explicit HTTPS trust in agent configs removed intermittent errors.

Current State

  • Vault over HTTPS: Running on https://127.0.0.1:22300 using server.key/crt and fullchain.crt.
  • PKI Mounts: pki-test active. AIA/CRL URLs configured to HTTPS.
  • App Leaf Agent (e.g., user **nctest**):
    • Service active (minimal unit), authenticates via AppRole, writes ~/tls/<app>.key and <app>.fullchain.pem, and attempts labeled container reloads.
  • Proxy CA Agent (e.g., user **proxytest**):
    • Service active (minimal unit), writes /home/proxytest/nginx/ca/current-ca-chain.pem.
  • Post-hook: Unified and environment-agnostic; uses container label tls=true.
  • CA Distribution: Automated to /home/<user>/vault/ca/ca.pem.

Pending/Partial:

  • Ensure all relevant containers are labeled tls=true and reachable by the agent’s user for reloads.
  • Confirm SELinux/AppArmor contexts if enabled (not covered).
  • Review userns keep-id versus rootful container policy for production.

Future Improvements

  1. Hardened Units (Optional): Reintroduce systemd hardening gradually with tested capability settings to avoid 218/CAPABILITIES.
  2. Policy Minimization: Restrict PKI roles per app (narrower domains/SANs) and split read/issue permissions more strictly.
  3. Health & Verification: Add periodic openssl verify and curl --cacert checks for Vault and proxy endpoints; alert on failures.
  4. Renewal Windows: Tune agent templates for proactive rotation and add metrics on certificate lifetimes.
  5. Configuration Management: Migrate scripts to Ansible roles; template all environment values from a single inventory.
  6. Observability: Centralize agent logs and add dashboards for issuance counts, errors, and reload outcomes.
  7. Production Readiness: Document the preferred container user strategy, finalize SELinux labels, and define an incident playbook.

Conclusion

The new pipeline provides a secure, repeatable TLS foundation: an offline Root, a managed Intermediate in Vault, HTTPS-served Vault, and automatic certificate/chain management via systemd user-level agents. By standardizing on a single reload label and minimizing per-app differences, operations are simpler and less error-prone. The approach resolves prior issues with permissions, capabilities, and HTTPS trust, and sets a clear path for further hardening and automation at scale.