Without DevOps at Help.com, I’d be out of job. That’s what Pat Webster, Help.com’s Senior DevOps Engineer, explained to me midway through our conversation about what the DevOps team is looking for in its next Site Reliability Engineer (SRE).
Well that’s pretty cool, I thought to myself. I better not piss him off.
As a Content Strategist, I have to admit my ignorance here: I had no idea what DevOps did besides hang out in the back office and do smart-people stuff all day. Apparently, the DevOps team quietly carries the welfare of the entire company on its shoulders.
It’s no big deal, really.
Not only are they responsible for building out Help.com’s infrastructure and prototypes to automate everything they can, but they’re also on call 24/7 so my coworkers and I can get a paycheck every two weeks.
I spoke to Pat because I wanted to understand what makes he and our CTO Evan Lucas tick. I also wanted to learn more about the SRE position. What exactly are we looking for? Why do we need one? Pat dug deep to give me the low-down.
RAQUEL GUARINO: What kind of values and principles do you look for in an SRE?
PAT WEBSTER: Insatiable curiosity is really important to me. That’s something that Evan and I talk about a lot. It’s hard to put your finger on exactly what it is, but if you get obsessed with what you do for a living and you’re never actually satisfied–like thinking something isn’t neat enough, or that it can always be cooler or faster–that’s the kind of person that’s really good at this job.
What about skills?
The biggest thing–it’s something we all agree on–is when something breaks, it’s not anybody’s fault. You’re gonna screw something up, it’s just gonna happen. But we fix it together. We’re not anybody that ever blames anybody else.
It’s the first job I’ve ever had where people actually act this way. Everybody talks about it, but it’s actually true here.
It feels like there could be a lot of pressure to get things right.
That’s kind of the thing, right? So if something happens and the company is offline, the fact that everybody else still gets paid is based on the fact that we can figure out what just happened. And that’s already enough stress as it is.
That totally makes sense. It’s important to stay cool under pressure.
Disaster recovery is what Evan called it.
What’s an example of a disaster?
There was an issue, it didn’t affect us, but not long ago Google’s Cloud Platform was giving the same IP address out to servers, which means they don’t work anymore. So their fix was to turn it off and then turn it back on. You can’t do anything. It sucks. You just have to kind of wait it out.
Other ones are network outages that again you can’t do much about. The hardest part is troubleshooting them to realize it’s not something you can control and being absolutely sure that that is the case. There are software changes all the time–like updates–and sometimes the old configuration files don’t work anymore, and you don’t know that until it breaks.
Okay, so knowing that, do you think an important part of your job is the investigative process?
Mhm. It’s about being able to troubleshoot incredibly complex things very quickly. And knowing how to google things very accurately quickly. That’s half the job.
So, what motivates you to do this job?
I really, really, really enjoy it. I have a cool job where I get to work on research experiments all day. And that’s not something most people get to do.
A lot of people have very mundane jobs where they go to work and know what they’re gonna do everyday.
Everyday is completely different. And at the same time, I can get a random hairbrained idea and bounce it off Evan, and we’ll both get super excited; there’s nothing standing in the way and we just go for it. There are some responsibilities that you have to keep up with on a daily basis, but a lot of it is “Hey, how can we make this better?” and then trying really hard, really fast to do it.
So what benefit do you think an SRE will bring to the team?
We’ll do more cool shit.