End-to-End PKI and TLS Automation
End-to-End PKI and TLS Automation
Abstract
This post describes the pipeline I ended up with for certificates and trust distribution: an offline root, environment-specific intermediates in Vault, HTTPS on Vault itself, and user-level agents that rotate app leaf certificates or proxy CA chains.
Background & Context
The goal was to standardize TLS across services with minimal per-app customization. Requirements included:
- A secure Root CA kept offline.
- An Intermediate CA managed by Vault for day-to-day issuance.
- Vault served over HTTPS on localhost, with a proper server certificate and chain.
- Automatic issuance/rotation of leaf certs for apps and automatic refresh of the issuing CA chain for reverse proxies.
- A single, uniform reload mechanism for containers (Podman) using a reusable label (
tls=true). - Scripts and config that work consistently across test and prod environments without special-case paths or names.
This was necessary to reduce manual handling of certificates, tighten security guarantees, and make rollouts/reloads predictable and repeatable.
Implementation & Decisions
1) Offline Root CA (01_make_offline_root_ca.sh)
What: Creates an offline Root CA per environment at a dedicated offline-root path, producing root-ca.key (kept offline) and root-ca.pem (public).
Why: Keeping the Root offline limits blast radius. Only the public root is distributed or referenced elsewhere.
2) Intermediate PKI in Vault (02_intermediate_in_vault_sign_with_root.sh)
What: Enables the environment PKI mount, generates a CSR inside Vault, signs it with the offline root via OpenSSL, then installs the signed intermediate certificate back into Vault. It also configures AIA and CRL URLs and copies the public root into the Vault TLS directory. Why: The Intermediate performs all operational issuance. AIA/CRL URLs enable validation. Copying the public root makes later server issuance and trust bootstrapping easier.
3) Vault Server Certificate (03_issue_vault_server_cert.sh)
What: Creates or reuses a PKI role and issues the Vault server certificate with the DNS and IP SANs the service actually needs. It writes:
- a private key
- a leaf certificate
- a full chain bundle
- a CA chain bundle
Why: Proper file modes and split artifacts match typical TLS consumers and simplify Docker/Podman mounts.
4) Vault over HTTPS (Podman Compose)
What: The Vault container serves HTTPS on a local listener with mounted configuration, state, and TLS material.
Why & Lessons:
- Initial failures stemmed from permission errors (
server.keyaccess) and attempts by the container tochownread-only mounts. We eliminated in-container chown expectations and ensured host-side ownership/modes were correct. - Running with
userns keep-idtriggered capability errors in some cases; we favored a configuration that does not rely on container-side chown or file capabilities.
5) Post-Hook for Agents (scripts/vault-agent-post.sh)
What: A minimal post-hook reads a JSON payload:
- If it contains a leaf (
private_key), it writes the key and full chain into the app TLS directory. - If it contains only a chain, it writes the CA chain to the proxy path.
Then it reloads Podman containers labeled tls=true by exec’ing inside them (e.g., nginx -s reload).
Why: A uniform reload path avoids per-app logic. The label decouples the agent from specific container names.
6) CA Distribution to Users (distribute_ca_to_agents.sh)
What: Copies a trusted CA file to each service user's trust path with the correct ownership and mode.
Why: Agents talking to Vault over HTTPS must trust Vault’s chain. This removes ad-hoc manual copying and fixes “unknown authority” errors.
7) App Leaf Agent (setup-vault-agent-app-config2.sh)
What: Per app:
- Creates minimal policies and an AppRole with a periodic token.
- Writes
vault-agent.hcland a template that issues a leaf from the Intermediate. - Sets the Vault HTTPS address and CA trust file in the agent config.
- Uses a file sink (
mode = 0600) for the token. - Installs a minimal systemd user service (no hardening flags) to avoid
218/CAPABILITIESissues. - Post-hook writes files to
~/tls/and triggers the label-based reload.
Why: Minimal units proved robust on hosts where systemd hardening and file capabilities caused agent startup failures. A file sink avoids /dev/null
temp-file errors and ensures integer mode is accepted.
8) Proxy CA Agent (setup-vault-agent-proxy-config2.sh)
What: Similar structure to the app agent, but the template reads the issuing CA and the post-hook writes the rendered chain into the proxy trust path. Why: Keeps chain refresh separate from leaf issuance (principle of least privilege) and uses the same minimal systemd unit pattern to avoid capability drop failures.
9) Switch AIA/CRL URLs to HTTPS
What: After Vault serves HTTPS successfully, the PKI mount URLs are updated to their HTTPS form for CA and CRL publishing.
Why: Ensures downstream validation paths reference TLS-protected endpoints.
Timeline / Evolution
- Root CA Creation: Succeeded; an early verification path typo was fixed to point to the new offline-root layout.
- Intermediate Setup: Enabled the environment PKI mount, CSR generated and signed by the offline Root. AIA warnings noted; URLs later switched to HTTPS.
- Vault Server Cert Issuance: Files generated under the environment TLS directory. Initial container start failed due to key permissions and chown attempts; volumes and modes were corrected.
- HTTPS & Unseal: Health check confirmed HTTPS; unseal initially failed due to trust; resolved by distributing Vault’s CA to operators/agents and/or
configuring
ca_cert. - Agent Rollout (App): First runs hit
218/CAPABILITIESdue to hardening. Switching to a minimal unit and a file token sink fixed startup and/dev/nulltemp-file errors. - Agent Rollout (Proxy): Same capability issue; resolved with minimal unit. A manual run confirmed
.ca.jsonrendering and CA chain write. Label-based reload ran (skipped if no labeled containers). - Stabilization: Uniform reload label (
tls=true), idempotent CA distribution, and explicit HTTPS trust in agent configs removed intermittent errors.
Current State
- Vault over HTTPS: running with a real server certificate and full chain.
- PKI Mounts: the environment PKI mount is active and publishes HTTPS URLs.
- App Leaf Agent (for one workload):
- Service active (minimal unit), authenticates via AppRole, writes the rendered leaf material, and attempts labeled container reloads.
- Proxy CA Agent:
- Service active (minimal unit), writes the NGINX CA chain under the proxy user’s home.
- Post-hook: Unified and environment-agnostic; uses container label
tls=true. - CA Distribution: automated per service user.
Pending/Partial:
- Ensure all relevant containers are labeled
tls=trueand reachable by the agent’s user for reloads. - Confirm SELinux/AppArmor contexts if enabled (not covered).
- Review
userns keep-idversus rootful container policy for production.
Future Improvements
- Hardened Units (Optional): Reintroduce systemd hardening gradually with tested capability settings to avoid
218/CAPABILITIES. - Policy Minimization: Restrict PKI roles per app (narrower domains/SANs) and split read/issue permissions more strictly.
- Health & Verification: Add periodic
openssl verifyandcurl --cacertchecks for Vault and proxy endpoints; alert on failures. - Renewal Windows: Tune agent templates for proactive rotation and add metrics on certificate lifetimes.
- Configuration Management: Migrate scripts to Ansible roles; template all environment values from a single inventory.
- Observability: Centralize agent logs and add dashboards for issuance counts, errors, and reload outcomes.
- Production Readiness: Document the preferred container user strategy, finalize SELinux labels, and define an incident playbook.
Conclusion
The result is a repeatable TLS pipeline instead of ad hoc certificate handling. The main gain is not novelty. It is that issuance, trust distribution, and reload behavior now follow one operational model across services.