More Reasons for Companies to Dogfood Their Software
At Sentry, dogfooding our products can ensure we provide the best experience for our customers, just like they use us to provide the best experience for theirs. Whether that means using Breadcrumbs to fix bugged database lock, updating problems with Safari Content Security Policy (CSP) rules by using Dashboards, or fixing ironic 404 errors on our Documentation pages with Discover and Metric alerts, dogfooding is essential.
In my first article, I explained how software companies can reap the rewards of “dogfooding.” This is an approach where they use their tools and software to not only solve the same problems their customers face, but also help test new technologies and find any potential issues or challenges.
At Sentry, dogfooding our products can ensure we provide the best experience for our customers, just like they use us to provide the best experience for theirs. Whether that means using Breadcrumbs to fix bugged database lock, updating problems with Safari Content Security Policy (CSP) rules by using Dashboards, or fixing ironic 404 errors on our Documentation pages with Discover and Metric alerts, dogfooding is essential to the way Sentry does business and improves our software.
Here are a few more examples of how Sentry benefited from this approach.
Recently, we experienced two minor outages related to database lock-contention — a situation where a process stops executing as it waits on another to release a shared resource they each depend on. Database locks are tricky to debug since it’s difficult to replicate a concurrent system with all of the unexpected side effects in an often single-threaded test suite. We know from experience how painful a long-running transaction can be when running Postgres.
We experienced brief moments of unavailability. As a side effect of a long-running query, our production database began to grind to a halt, leaving our billing system in an incomplete state. In other words, pending changes to a paid subscription couldn’t be applied without manual intervention. To identify the root cause, we had to look deeper. We asked ourselves if any other issues could be traced back to the organization affected by this odd billing state.
Thanks to our robust tagging and search infrastructure, we found the original exception that caused the inconsistent state. With the original exception identified, we used our Breadcrumbs tool to understand the sequence of SQL statements that led up to the lock-contention:
Thanks to these Breadcrumbs, we realized that while most of our code paths use our default database connection, our billing code relies on a special dedicated database connection. The dedicated connection gives us additional flexibility when making sensitive mutations in transactions isolated from transactions used throughout the rest of our codebase. Once we recognized that the lock was being held longer than it should have (minutes, rather than seconds), we realized this could have a cascading impact and ultimately cause the outages experienced twice before. To fix this, we had to use the correct database connection in this critical code path.
Securing Code Policies
Put simply, a CSP lets a browser know which content sources are to be trusted — and which aren’t. And when there is a CSP violation, browsers can submit the error to a report-uri. With Sentry, these violation reports are integrated into the application’s monitoring dashboards. During a recent penetration test, we found that some of the CDN domains in our allow list were hosting potentially dangerous scripts. One of our recommendations from this audit was to refine and improve our CSP rules.
CSPs have two modes. The first mode enforces and actively blocks resource loading and execution, while the second collects the errors that would happen if the rules were active. This mode is set via the Content-Security-Policy-Report-Only header that defines CSP rules. When we combined this ‘report-only’ feature with Discover queries, we were able to visualize all the errors coming from report-only mode. Now we could view the impact of fixes in real time and see which rules were broken without disrupting customers.
While the report-only mode worked well, we encountered some hiccups in Safari. This meant we were getting errors for sources that should be allowed. To verify that Safari bugs caused these errors, we used our staging environment to test Safari with only the new rules being enforced. And using Dashboards, we were able to fully view Safari’s impact on our CSP errors.
Documentation is a roadmap. Just as maps are updated to reflect changing infrastructure, we update our documentation to give users the best way to purposefully use Sentry.
As we transitioned to a new documentation structure, all our existing links collapsed, leading to a sea of 404s. For most companies, a 404 on their website is a minor embarrassment. For an error monitoring company, a 404’d page is a special kind of irony.
Since we had to reproduce each page from scratch, we couldn’t just patch a bunch of redirects and call it a day. First, we had to understand the scope of the problem. So we used our insights tool Discover to identify where the content was broken and prioritize that content by the user or request count. Then, with Metric Alerts, we were notified if any future deployment caused a spike in 404s.
To stop any future 404s, we built out a homegrown linkchecker to evaluate and identify broken links and anchors, while also requiring that all links must pass the linkchecker test before merging. Just like some drivers refuse to consult maps, we understand that some developers don’t like reading documentation. And while there are other ways to learn our products’ best practices, there’s no better resource to get code moving in the right direction than documentation. We certainly proved that here, as we used our platform to help us create documentation that helps developers use our platform.