SSO Security Rebuild
Rebuilt a critical authentication system serving 250,000 users in 6 developer weeks after a security audit revealed vulnerabilities requiring immediate remediation.
Executive Summary
I identified critical security risks in NAIS's legacy SSO system—the authentication gateway for 250,000 users across 1,700+ member schools. Knowing that security investments compete with feature work for funding, I commissioned a third-party security assessment to build the business case. The audit confirmed my concerns: multiple critical and high-severity vulnerabilities stemming from fundamental architectural flaws. With findings in hand, I secured contingency funding and led a complete rebuild using Phoenix LiveView. I also drove communications with all external vendor development partners to coordinate changes and testing. The project shipped in 6 developer weeks with zero downtime and zero disruption to client applications or vendors. A novel Cloudflare Worker solution enabled production A/B testing at sso.nais.org, allowing real-world validation with instant rollback capability.
The Problem
The SSO system predated my arrival and ran on AngularJS—a framework that reached end-of-life in December 2021. I had flagged it as a security liability, but leadership required external validation before committing resources to a rebuild. I commissioned a third-party security firm to conduct a comprehensive assessment. The audit confirmed my concerns: multiple critical and high-severity vulnerabilities stemming from fundamental architectural issues. The findings created clear organizational consensus: the system needed to be rebuilt from the ground up.
Options Considered
Patch Legacy System
Estimated 8-12 weeks to address each vulnerability individually. High regression risk on an EOL framework with increasingly scarce AngularJS expertise. Would still leave us needing another migration within 2 years. Rejected—addresses symptoms, not root cause.
Incremental Migration
Gradually replace components while maintaining the legacy system. Lower immediate risk but extends the exposure window and doubles maintenance burden. Rejected—prolongs vulnerability exposure.
Complete Rebuild (Chosen)
Full rebuild using Phoenix LiveView with a hard constraint: must be a drop-in replacement requiring zero changes from client applications or third-party vendors. Modern framework with built-in security patterns. Eliminates technical debt and enables future OAuth 2.1 upgrade. Selected for architectural soundness and faster timeline.
The Solution
I led the ground-up rebuild using Phoenix LiveView, a framework with security-first design patterns, built-in CSRF protection, and parameterized queries by default. Rather than patch individual vulnerabilities, the new architecture eliminates entire classes of security issues. A critical constraint shaped the project: the new SSO had to be a drop-in replacement requiring zero changes from client applications or third-party vendors. I drove communications with all external vendor development partners to specify the integration requirements, coordinate testing windows, and validate compatibility before cutover.
The key technical innovation was the deployment strategy. To achieve zero-downtime cutover, we designed a Cloudflare Worker solution:
- Deploy the new SSO alongside the legacy system on separate infrastructure
- A Cloudflare Worker at sso.nais.org reads a browser cookie on each request
- Route users to the old or new system based on the cookie value
- Test with internal users first, then gradually expand to the general population
- Maintain instant rollback capability throughout—if anything breaks, flip the cookie logic
This approach enabled real-world production validation without risk. We could test with actual user traffic while maintaining the ability to instantly route everyone back to the legacy system.
Architecture Benefits
- Parameterized queries: Phoenix’s Ecto library makes injection attacks architecturally impossible—not through careful coding, but through framework design.
- Cryptographic token binding: Session and reset tokens are bound to specific users with configurable expiration, eliminating token-reuse attacks.
- Router-level authentication: Authentication middleware at the router level means endpoints cannot be accidentally exposed. Security is structural, not per-endpoint.
- Allowlist validation: Only approved redirect domains are permitted, preventing token leakage through open redirects.
Implementation
Discovery (Q2 2025)
Third-party security assessment documenting findings and remediation priorities
Business Case (Q3 2025)
Presented findings to Chief Counsel and CFO; secured contingency funding based on breach cost avoidance
Development (Q4 2025)
6-week sprint building Phoenix LiveView application; Cloudflare Worker testing infrastructure; feature parity validation
Launch (January 2026)
Production deployment with zero downtime; zero client disruption; all vulnerabilities remediated
Impact & ROI
Risks & Mitigation
| Risk | Mitigation |
|---|---|
| Service disruption during migration | Cloudflare Worker A/B testing with instant rollback capability |
| Vendor integration breakage | Feature-complete parity testing before cutover; drop-in replacement constraint |
| Vendor/client integration costs | Drop-in replacement requiring zero changes from vendors or client apps—avoided potential integration fees entirely |
| Timeline overrun | Fixed scope—OAuth 2.1 upgrade deferred to separate phase |
| Regression in new system | Modern testing framework; router-level security guarantees |
Stakeholders
Sponsor:
Chief Counsel and CFO (approved contingency funding based on breach risk)Users:
250,000 platform users across 1,700+ member schoolsExternal Partners:
Third-party vendor development teams (coordinated integration testing and cutover)Contributors:
Senior Elixir developer; third-party security firmKey Features
Parameterized Queries
Ecto's query builder makes SQL and SOQL injection attacks architecturally impossible. This isn't a matter of careful coding—the framework simply doesn't allow query construction patterns that could be exploited.
Router-Level Authentication
Authentication middleware applied at the router level means endpoints cannot be accidentally exposed. New routes are secure by default; developers must explicitly opt out of authentication rather than remembering to opt in.
Cloudflare Worker Routing
Production A/B testing between old and new systems via cookie-based routing at sso.nais.org. This pattern enabled real-world validation with instant rollback and is applicable to any high-stakes system cutover.
Cryptographic Token Binding
Password reset and session tokens are cryptographically bound to specific users with configurable expiration. This eliminates an entire class of token-reuse attacks that plagued the legacy system.
Vendor Coordination
The drop-in replacement constraint required close coordination with external development partners. I drove communications specifying integration requirements, scheduled testing windows, and validated compatibility before cutover.
Lessons Learned
Commissioning an external security assessment created organizational alignment that internal advocacy alone couldn't achieve. The drop-in replacement constraint kept scope tight and the timeline achievable. The Cloudflare Worker approach eliminated deployment risk entirely—I would use this pattern again for any high-stakes authentication migration. Phoenix LiveView's security defaults meant fewer decisions to get wrong. I would have commissioned the security audit earlier and would document the Cloudflare Worker routing pattern as a reusable playbook. Next phase: OAuth 2.1 upgrade.
What worked well:
Commissioning an external security assessment created organizational alignment that internal advocacy alone couldn’t achieve. Having a third party document the risks gave leadership the validation they needed to approve contingency funding.
The drop-in replacement constraint kept scope tight and the timeline achievable. By committing to zero changes for client applications and vendors, we avoided scope creep and delivered in six weeks.
The Cloudflare Worker approach eliminated deployment risk entirely. I would use this pattern again for any high-stakes authentication migration—the ability to test in production with instant rollback changed the risk calculus completely.
Phoenix LiveView’s security defaults meant fewer decisions to get wrong. The framework makes entire vulnerability classes architecturally impossible rather than relying on developers to remember best practices.
What I’d do differently:
I would have commissioned the security audit earlier. The vulnerabilities had existed for years, and the rebuild could have happened sooner if I’d pushed for external validation earlier in my tenure.
I would document the Cloudflare Worker routing pattern as a reusable playbook. It’s applicable to any high-stakes system migration, not just authentication, and deserves to be a standard tool in the infrastructure toolkit.
Next phase: OAuth 2.1 upgrade, which will require client application and vendor coordination. The Phoenix foundation makes the technical work straightforward.