Dear Blackthorn customers and partners,
We are now in our sixth year, we have 500+ customers, two core products, a stack of customer use cases, very high spikes of transaction rates (API calls), and more complex architectures than ever before. Our Events app in particular has gone down (inaccessible customer-facing view) a number of times in the last year, sometimes for hours at a time.
All software companies have growing pains and outages including Slack, Salesforce, Twitter, LinkedIn and even AWS! Going down will happen, but it’s our job to ensure we do every single thing in our power to avoid it from happening.
On that note, we’d like to share some details around recent incidents and what we’re doing about it:
- In October 2021, we pushed a big enhancement to the Events webapp (“New Checkout”) which inadvertently caused headaches for some customers. We had to roll some items back and create hotfixes based on major bugs customers were experiencing. This was a holistic systemic issue with many failure points. From the start, the enhancement didn’t have full granular requirements. This is due to the fact that we had no coordinated Product team to write the epic and the associated user stories. We now have two product people and are hiring two more immediately. Additionally, due to self-prescribed release timelines, we pushed it when we should have continued to do more testing. We now have removed scheduled release dates/deadlines and are performing more extensive testing in our release cycles so this does not happen ever again – which includes internal bug bash testing, customer pre-general release testing, and even feature flagging big new features and enhancements before going to all customers.
- Going forward, all new potentially-breaking functionality will be feature-flag released. This means only select customer organizations will get access to the features and we’ll be doing staggered releases to customers, like Tesla is doing for their new FSD beta. This will allow us to pick up on bugs from customers who opt-into this experience, plus ourselves dog-fooding our product.
- Yes, we have acquired some companies recently, but we hear our current customers first. What they want is to buy stable products, especially because their business relies on the purchased apps. So in the background, we’ve been scaling heavily. Our Events team, across development, product, and QA, is now roughly ~20 people split between focusing on infrastructure improvements, bug fixing and small enhancements and improvements for stability. That’s bigger than most Salesforce Engineering companies in totality. There’s a singular focus on stability, to the extent that we’ve halted the creation of all new features taking more than one day to build in lieu of the focus on stability. And that’s all this team is working on.
- We have to determine if a customer is allowed to have a live event by determining if their license is active. This means we have to query the environment where we maintain this data. Our source Salesforce org is currently being queried for this. This has caused API limit issues, of which we’ve raised with Salesforce (this is one outage), but also maximum concurrent API request issues (of which you can’t raise). We’re now in the process of replicating this License data to our own data store on MongoDB outside of Salesforce (license number, license status, org ID – No PII), which means we eliminate the concurrent API request limit.
- Per-Salesforce org queries are hitting rate limiters for high volume events. We never anticipated a single org sending out 500,000 event invitations, all at once, for a single event (yes, this happened!). It caused sporadic access of our events platform-wide plus broke the accessing of that org’s events. This led us to creating a new architecture that’s now underway to have event, event item, and session-level capacities managed at the Heroku/MongoDB level, syncing in a batch back to Salesforce. Salesforce was never designed as a high volume database, so we’ve had to make some significant changes to account for this.
- We upgraded to Heroku Enterprise for our core Events and Connect 360 Heroku apps. This allowed us to set granular per-user security options, which had previously caused an outage as a configuration variable was changed incorrectly, and implement a new process of managing Heroku configuration variables based on those roles. Additionally, the plan lets us increase (and remove) limits on auto-scaling dynos, which will increase performance at scale.
- Event caching has failed a number of times to update. We previously queried all events (and all the related data) in an org as a “blob” when someone would do a hard refresh. This caused many issues (not full outages, but terrible performance). We’ve now implemented a “granular cache” so only the singular Event data cache is invalidated, and the Update button in Salesforce updates all the events and associated data.
As a product-led organization, we personally apologize for the outages. We know this doesn’t change what’s transpired, but you have our word and commitment to focusing only on stability until we are rock solid, and only then will our core Engineering team resume building enhancements and new features.
Now that we’ve shown you our plan, we look forward to showing you our work in a follow-up post in the near future. We’re a data-driven company, and want you to see the information in a results-oriented manner.
Thank you for giving us the privilege to serve you and your customers.
Thank you!
Blackthorn Team