End-to-End PKI & TLS Automatio
Quick Commands
# edit + deploy
git status
git add -A
git commit -m "docs: update"
git push
# rebuild static blog output (local)
cd site
npm ci --no-audit --no-fund
npm run build
# VPS: pull only
# (on server)
git pull --ff-only
End-to-End PKI & TLS Automation with HashiCorp Vault, Podman, and systemd User Services
Abstract
This document describes how we built a reproducible, environment-agnostic TLS pipeline for applications and proxies using HashiCorp Vault. It covers creating an offline Root CA, provisioning an Intermediate CA inside Vault, enabling HTTPS for Vault itself, and automating certificate issuance and CA-chain refresh with systemd user-level Vault Agents and a lightweight post-hook that reloads containers via a common label. The paper documents key design choices, issues encountered (permissions, capabilities, HTTPS trust), and the fixes that led to a stable setup.
Background & Context
The goal was to standardize TLS across services with minimal per-app customization. Requirements included:
- A secure Root CA kept offline.
- An Intermediate CA managed by Vault for day-to-day issuance.
- Vault served over HTTPS on localhost, with a proper server certificate and chain.
- Automatic issuance/rotation of leaf certs for apps and automatic refresh of the issuing CA chain for reverse proxies.
- A single, uniform reload mechanism for containers (Podman) using a reusable label (
tls=true). - Scripts and config that work consistently across test and prod environments without special-case paths or names.
This was necessary to reduce manual handling of certificates, tighten security guarantees, and make rollouts/reloads predictable and repeatable.
Implementation & Decisions
1) Offline Root CA (01_make_offline_root_ca.sh)
What: Creates an offline Root CA per environment at ~/vault/offline-root/<env>, producing root-ca.key (kept offline) and root-ca.pem (public).
Why: Keeping the Root offline limits blast radius. Only the public root is distributed or referenced elsewhere.
2) Intermediate PKI in Vault (02_intermediate_in_vault_sign_with_root.sh)
What: Enables an Intermediate PKI mount (e.g., pki-test), generates a CSR inside Vault, signs it with the offline Root via OpenSSL (v3 intermediate
profile), then installs the signed Intermediate cert back into Vault. It also configures AIA/CRL URLs and copies the public root to
/home/vault/tls-<env>/root_ca.pem.
Why: The Intermediate performs all operational issuance. AIA/CRL URLs enable validation. Copying the public root makes later server issuance and trust
bootstrapping easier.
3) Vault Server Certificate (03_issue_vault_server_cert.sh)
What: Creates/uses a PKI role (EC keys, server usage) and issues the Vault server certificate with DNS/IP SANs for 127.0.0.1. Writes:
/home/vault/tls-<env>/server.key(0600)/home/vault/tls-<env>/server.crt(0644)/home/vault/tls-<env>/fullchain.crt(leaf + intermediate)/home/vault/tls-<env>/ca_chain.pem(intermediate + root)
Why: Proper file modes and split artifacts match typical TLS consumers and simplify Docker/Podman mounts.
4) Vault over HTTPS (Podman Compose)
What: Vault container serves HTTPS on 127.0.0.1:22300 with volume mounts for config, file, and tls-<env>.
Why & Lessons:
- Initial failures stemmed from permission errors (
server.keyaccess) and attempts by the container tochownread-only mounts. We eliminated in-container chown expectations and ensured host-side ownership/modes were correct. - Running with
userns keep-idtriggered capability errors in some cases; we favored a configuration that does not rely on container-side chown or file capabilities.
5) Post-Hook for Agents (scripts/vault-agent-post.sh)
What: A minimal post-hook reads a JSON payload:
- If it contains a leaf (
private_key), it writes<app>.keyand<app>.fullchain.peminto the app’s TLS dir. - If it contains only a chain, it writes the CA chain to the proxy path.
Then it reloads Podman containers labeled tls=true by exec’ing inside them (e.g., nginx -s reload).
Why: A uniform reload path avoids per-app logic. The label decouples the agent from specific container names.
6) CA Distribution to Users (distribute_ca_to_agents.sh)
What: Copies a trusted CA (Root or chain) to /home/<user>/vault/ca/ca.pem with correct ownership/mode, idempotently.
Why: Agents talking to Vault over HTTPS must trust Vault’s chain. This removes ad-hoc manual copying and fixes “unknown authority” errors.
7) App Leaf Agent (setup-vault-agent-app-config2.sh)
What: Per app:
- Creates minimal policies and an AppRole with a periodic token.
- Writes
vault-agent.hcland a template that issues a leaf from the Intermediate. - Sets
vault { address = "https://127.0.0.1:22300"; ca_cert = "/home/<user>/vault/ca/ca.pem" }. - Uses a file sink (
mode = 0600) for the token. - Installs a minimal systemd user service (no hardening flags) to avoid
218/CAPABILITIESissues. - Post-hook writes files to
~/tls/and triggers the label-based reload.
Why: Minimal units proved robust on hosts where systemd hardening and file capabilities caused agent startup failures. A file sink avoids /dev/null
temp-file errors and ensures integer mode is accepted.
8) Proxy CA Agent (setup-vault-agent-proxy-config2.sh)
What: Similar structure to the app agent, but the template reads the issuing CA (pki-test/cert/ca) and the post-hook writes the chain to
/home/<proxyuser>/nginx/ca/current-ca-chain.pem.
Why: Keeps chain refresh separate from leaf issuance (principle of least privilege) and uses the same minimal systemd unit pattern to avoid capability drop
failures.
9) Switch AIA/CRL URLs to HTTPS
What: After Vault serves HTTPS successfully, the PKI mount URLs are updated to https://127.0.0.1:22300/v1/<mount>/{ca,crl}.
Why: Ensures downstream validation paths reference TLS-protected endpoints.
Timeline / Evolution
- Root CA Creation: Succeeded; an early verification path typo was fixed to point to the new offline-root layout.
- Intermediate Setup: Enabled
pki-test, CSR generated and signed by the offline Root. AIA warnings noted; URLs later switched to HTTPS. - Vault Server Cert Issuance: Files generated under
/home/vault/tls-test. Initial container start failed due to key permissions and chown attempts; volumes and modes were corrected. - HTTPS & Unseal: Health check confirmed HTTPS; unseal initially failed due to trust; resolved by distributing Vault’s CA to operators/agents and/or
configuring
ca_cert. - Agent Rollout (App): First runs hit
218/CAPABILITIESdue to hardening. Switching to a minimal unit and a file token sink fixed startup and/dev/nulltemp-file errors. - Agent Rollout (Proxy): Same capability issue; resolved with minimal unit. A manual run confirmed
.ca.jsonrendering and CA chain write. Label-based reload ran (skipped if no labeled containers). - Stabilization: Uniform reload label (
tls=true), idempotent CA distribution, and explicit HTTPS trust in agent configs removed intermittent errors.
Current State
- Vault over HTTPS: Running on
https://127.0.0.1:22300usingserver.key/crtandfullchain.crt. - PKI Mounts:
pki-testactive. AIA/CRL URLs configured to HTTPS. - App Leaf Agent (e.g., user
**nctest**):- Service active (minimal unit), authenticates via AppRole, writes
~/tls/<app>.keyand<app>.fullchain.pem, and attempts labeled container reloads.
- Service active (minimal unit), authenticates via AppRole, writes
- Proxy CA Agent (e.g., user
**proxytest**):- Service active (minimal unit), writes
/home/proxytest/nginx/ca/current-ca-chain.pem.
- Service active (minimal unit), writes
- Post-hook: Unified and environment-agnostic; uses container label
tls=true. - CA Distribution: Automated to
/home/<user>/vault/ca/ca.pem.
Pending/Partial:
- Ensure all relevant containers are labeled
tls=trueand reachable by the agent’s user for reloads. - Confirm SELinux/AppArmor contexts if enabled (not covered).
- Review
userns keep-idversus rootful container policy for production.
Future Improvements
- Hardened Units (Optional): Reintroduce systemd hardening gradually with tested capability settings to avoid
218/CAPABILITIES. - Policy Minimization: Restrict PKI roles per app (narrower domains/SANs) and split read/issue permissions more strictly.
- Health & Verification: Add periodic
openssl verifyandcurl --cacertchecks for Vault and proxy endpoints; alert on failures. - Renewal Windows: Tune agent templates for proactive rotation and add metrics on certificate lifetimes.
- Configuration Management: Migrate scripts to Ansible roles; template all environment values from a single inventory.
- Observability: Centralize agent logs and add dashboards for issuance counts, errors, and reload outcomes.
- Production Readiness: Document the preferred container user strategy, finalize SELinux labels, and define an incident playbook.
Conclusion
The new pipeline provides a secure, repeatable TLS foundation: an offline Root, a managed Intermediate in Vault, HTTPS-served Vault, and automatic certificate/chain management via systemd user-level agents. By standardizing on a single reload label and minimizing per-app differences, operations are simpler and less error-prone. The approach resolves prior issues with permissions, capabilities, and HTTPS trust, and sets a clear path for further hardening and automation at scale.