Protect Your Data with Time-Based Infrastructure Access

The pager is buzzing. It’s early in the morning. The on-call engineer needs to access a customer instance to debug an issue.

by Ethan Houston, Security Engineer @Lumos

The pager is buzzing. It’s early in the morning. The on-call engineer needs to access a customer instance to debug an issue. Company security is a top priority, so the engineer by default doesn’t have permission to access any customer instances. Unfortunately, this means the on-call engineer needs to page and wake up the security engineer to get the permission they need, and the security engineer is going to need to revoke that permission the next morning.

Access control is a constant battle between security and productivity. Overly permissive access is a risk, but so is not being able to troubleshoot issues in a timely manner. Managing permissions at scale is incredibly hard. This article will give you a glimpse into how Lumos manages permissions for our developers securely and with minimal friction.

The Principle of Least Privilege

Let’s picture an ideal world first. In this optimal situation, an employee would possess the precise permissions they require at the exact moment their job demands it. That’s what the Principle of Least Privilege (PoLP) is all about. It’s a concept that systems, devices, and users (both human and machine) must be granted the minimum level of access needed to perform their duties.

Not Enough Access

Anything less than minimum access prevents employees from getting their job done. You’ve probably experienced this yourself at some point: you needed to get something done, clicked a button, and were greeted with a “you lack sufficient permissions” error. Then, you would probably have to file a ticket with IT and wait a considerable amount of time. It’s annoying and it slows you down. Not being able to grant access quickly is also a business risk: our on-call engineer needs to be able to respond to customer incidents in a timely manner.

Too Much Access

However, anything more than minimum access is an inherent security risk, whether from a malicious employee, compromised endpoint, or just an honest mistake. Recent hacks such as with Okta and Uber demonstrate that attackers are becoming more adept at getting around MFA, heightening the stakes. Let’s go back to our on-call engineer from earlier. If they keep the elevated access and their account is compromised, not only will the systems they access for their normal job duties be at risk but so will each service they’ve acquired ad-hoc access to over time. It’s similar to why we wear seatbelts: even if an accident happens, it mitigates the risk of injury. The same is true with least privilege and the need to remove unnecessary access.

Why RBAC and Access Reviews are not Good Enough

If seatbelts were painful, would you want to wear them? When security measures stifle productivity, users inevitably dodge them, leading to unwanted consequences. Like a good seatbelt, top-tier security allows us to cruise at full speed with the reassurance we need. So when it comes down to choosing between security and productivity, it's not a game of either/or, but a promise of both.

Implementing the Principle of Least Privilege may sound straightforward, but it can be a tough nut to crack within an organization. Role-Based Access Control (RBAC) is a common tactic, but it doesn't always get the job done. Our friends at Segment can attest to that. They learned that automating the process with sophisticated scripts based on roles and teams wasn't a magic fix.

They faced hurdles like: managing team restructures involving name changes, swaps, merges, or splits; navigating the shift of users across teams; and addressing situations where teams needed temporary access to tools outside their usual sphere. This led to a network of users who had access to sensitive roles and permissions they didn't necessarily need. Their RBAC-only approach to managing persistent access just wasn't scalable.

A one-time audit of privileged access will address the symptom, but not the disease. Not only will these audits become harder or impossible to perform as the company grows, but they do nothing to prevent the same situation from happening again. And again.

With hundreds of roles across a range of SaaS apps and cloud providers, each offering different access levels, it is a complex task that cannot be solved only through RBAC or regular access reviews. A single place to manage fine-grained and short-lived system access is needed. Here is how we built that at Lumos.

Use Case: Customer Instances at Lumos

From time to time, engineers or customer support will need to access customer instances to debug issues. We follow a default-deny approach; access to a given customer’s instance needs to be explicitly granted. We have an internal system that provisions and deprovisions this access. It worked, but we had to run it manually when an employee would request access over a Slack DM. Then we had to remember to revoke access.

Authorization around internal tooling is a problem faced by companies of all sizes. We leverage Lumos to streamline and monitor access to our sensitive internal tooling.

Let’s break the problem down.

1. What are the requestable resources?

In our case, each customer instance is a requestable resource. This list of customers is dynamic and will change over time. We have a AWS Lambda function that runs daily that answers the question: “what is our current list of active customers?”. Then, the function creates a “requestable permission” in Lumos for each customer using the Lumos API.

We can configure these permissions so that they can only be requested for either 2 or 4 hours.


2. Who should be pre-approved?

As said before, security is a constant balance between risk and business value. In our case, if a user is currently on-call for an engineering team, they should automatically be granted time-based access to a specific customer’s instance.

Lumos integrates with PagerDuty, enabling you for example to automatically pre-approve employees who are currently on-call.


3. Slack-based access

During a stressful incident you don’t want the engineer to overthink. We make requesting access easy with our Slack-first flow. In addition to the Slack client there is also the web UI if that is preferable. The end user can pick to get access for, say, 2 hours. However, users don’t always know how long they need access. So, an end user can pick that they want to have access as long as they are on-call, defined in PagerDuty in our case. With Lumos, we support both time-base access and event-bound access.

4. Provisioning through webhooks

We have an Okta Workflow set up to call our internal service that provisions access to customer instances. We were then able to set up this workflow as a webhook on Lumos.

5. Automatic time-based deprovisioning

We avoid relying on a human to remember to deprovision access. By restricting the access length to 2 or 4 hours or to the time an engineer is on-call, we are guaranteed that there will be no long-lived access to any customer instance.

This is secure IAM by design rather than by process. It’s less complex, requires less manual work by all parties, and eliminates the possibility of sustained privileged creep.

6. Full Audit Log

When audit time rolls around we won’t need to go through the laborious process of performing a point-in-time access review for all of your internal tooling. We export our audit-logs and prove how access is provisioned and deprovisioned with certain timestamps. Also, we can send all those logs to a SIEM like Sumo Logic or Splunk

Summary

Big enterprises and DoD vendors are trusting Lumos with their data because of our approach to access control. By leveraging our own tool we’ve been able to reduce the complexity and size of access reviews while at the same time strengthening our security posture.

We’re able to make everyone happy: the engineer is happy because they were on-call and were immediately unblocked because of pre-approvals, the security team is happy because that access will be given and revoked automatically, the auditors are happy because this is all recorded in our Activity Log, and sales is happy because even heavily regulated customers trust Lumos to manage their access.

Many other companies have built internal tools for this. Airbnb's Access Manager, GitHub's Entitlement App, Twilio Segment's Access Service, … We built Lumos to streamline what many teams have built internally to prove that a great security tool can both make you more productive and decrease risk.