![]() As a result, their actions have immediate impacts in ensuring our systems run smoothly. They’re the first to spot a potential problem, the first to coordinate a response, and often the first point-of-contact for our partners. Impactīehind our systems and the hundreds of services we’ve integrated is the Firewatcher keeping a close and careful eye. The process of Reflection keeps us sharp and prepares us for situations regardless of whether we’ve seen them before. ![]() We all document our ideas, which could include tuning alerts or building additional failovers, then assign them out to engineers to implement. We may even throw-in hypothetical curveballs - what if there was a simultaneous outage in another service? - to test our ability to respond. Our team would evaluate how we handled it and determine how we could further optimize our response. Here, I would create a detailed writeup, including a timeline and analysis of the incident. Once the incident has closed, we do what we call a “Reflection” every Friday. We then inform our partners that all services are back online. As the airline’s services come back up, we will see our alerts close and metrics gravitate to nominal levels. We will quickly relay updates from the airline to our partners while keeping a close-eye on our systems. This could include using alternate services or temporarily halting transactions with the problematic airline. While an airline outage is a critical issue, our systems are built to handle this and automatic failovers would already be taking place. While this is happening, I will notify our partners about the outage. The supporting engineer will contact the airline’s technical team directly and assist them in diagnosing the issue. I will call in an additional engineer to assist with communications while asking others to remain on standby. Analyzing these traces will allow me to quickly confirm the outage and determine its impact on our systems. Taking the role of Firewatcher, I will immediately begin pulling transaction traces using our custom-built logging and tracing tools. However, our alerting systems are so sensitive that we often detect an airline outage before the airline themselves have even noticed the outage. Like smoke against clouds, a few bad requests among thousands of successful ones could be easy to miss. When this happens, here’s how we respond. But every now and then, an airline we’ve integrated has an unexpected outage. Our systems hum along while handling tens-of-thousands of transactions. ![]() Every engineer at Gordian is trained to be on Firewatch, and each of us takes on week-long shifts. If there’s an incident, they coordinate the response. The Firewatcher is our first-responder for issues that may impact our partners. From purchasing bags to securing the last seat on an urgent flight home, our partners depend on these systems to serve their customers. At any given second, our systems are processing critical transactions. Instead of maps and alidades, they use comprehensive monitoring and alerting tools that track every part of our system. They’re on constant lookout for potential problems among the hundreds of services we’ve integrated. With their compass, map and alidade, they quickly calculate the smoke’s location and relay it to fire suppression teams.Īt Gordian, we also have a Firewatcher. In the distance, they see something so subtle that most would miss - a small, rising plume of smoke against the clouds. From the lookout station, the Firewatcher scans the horizon.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |