Over the last few months, you may have seen periods of intermittent service disruptions. If I was in your shoes, I’d be frustrated, uncertain, and questioning- Why Blackthorn? Thank you for your continued trust. I’m writing this to describe the actions of what’s occurred, what we are immediately doing, and what will be in place by the end of October or sooner.
While two of these performance issues were due to quick-to-address bugs, the rest have been from high volume. With each high volume issue we’ve run into, we’ve made immediate improvements to scalability.
On March 11, we were receiving 12 registrations per second to the platform, which was more traffic than our platform could consistently handle.
- Additional 4 US instances on Heroku for the worker (distribution mechanism) to distribute the traffic. We currently have one primary US instance and are setting up a second, aside from our EMEA and APAC instances.
- Creation of the second instance on Thursday worked properly to distribute the traffic, however it made the checkouts fail due to how we utilize our Redis cache. Redis cache is used for client sessions and sessions were crossing between instances, which is now fixed.
- We’ve begun setting up the Cloudflare load balancer across all instances in addition to our worker functionality.
- We’re creating sticky sessions so a user hits the same heroku instance throughout the whole session
- Removing unpublished events from the cache to reduce the query size
- We’re now at the most performant options available from Heroku single instances ($100K+/yr), increasing this flow-through via the worker distribution mentioned above.
- Cloudflare: We’ve implemented their Argo feature set for further load handling.
- Cloudflare: Vanity domains now run under their own specific handling. A few times, customers have initiated either penetration or end-to-end load testing to test our APIs, which creates 50+ requests per second, clogging traffic. This is now modulated.
- Cloudinary: Images now run on a performant CDN (content distribution network), to make image load time rapid. Media (soon video too) is accessed from the nearest available server to your geography.
- We have a new development pipeline to give us access to better understanding of errors, identifying ways to automate more and increase the efficiency of regression testing.
- We hired a new dedicated person for stress and smoke testing that will allow us to execute testing against specific customer scenarios.
Hiring & Process
- Offers extended to three platform developers as of Friday. We will be hiring four.
- Many more platform developers in the hiring pipeline are taking our coding challenge (that we pay $200 to complete).
- Our new VP of Engineering, Nick Snyder, starts full-time in two weeks, part-time already. Nick comes from a background of both Salesforce and Full Stack technical development, as well as having previously migrated from Heroku to AWS. AWS will allow us to have a series of redundant instances, true horizontal scaling (high traffic scalability), and deeper experience in scaling to a 500+ concurrent session platform (equivalent to 1,000,000+ invite emails going out at once).
By end of October or sooner
- Removal of the client-side `clear cache` method to rely upon the Salesforce-side cache invalidations aka the Publish and Update buttons. When a customer does a hard refresh today, it force-invalidates the cache for the instance, which creates a large query.
- Separating the registration from Salesforce. This means that the registration payload will cache on our database for up to ten seconds or until we have five registrations to bulk submit to Salesforce, whichever comes first. This will reduce the number of queries to Salesforce and the load on our data connector.
- Caching of AttendeeLink data to our Mongodb cache. This will alleviate all Salesforce traffic load queries (there’s an unmovable limit on inbound Salesforce API calls) and any Salesforce API limits. AttendeeLink (invites or post-registration URLs) are queries in real-time, so if you send out say 100,000 invites and your registrants view say 10 concurrent (everyone clicks the button at once) links at once, it piles up. By caching, there is zero strain or latency on Salesforce as the data will be server-side cached.
Names and emails will be cached in encrypted-cache, both in transit and at rest, using TLS1.2 and 256 bit encryption. We™re already through our SOC 2 audit and will be going through 3rd party audits for HIPAA approval in conjunction with this move, so there will be no concerns with cached data. If your org is maintaining other data on your customers, it will not be cached, only data that is used to prefill registration forms and for what will be used in our forthcoming Live Events add-on (email, name, title, company). These will come with a right-to-be-forgotten mechanism to comply with GDPR.
- Moving nearly all Heroku assets to AWS. AWS has significantly further capabilities for scaling infrastructure than Heroku. Nearly all large companies on the planet run on either AWS (largest), Microsoft Azure, or Google Cloud. Specifically around horizontal scalability, which is the real-time spinning up of instances based upon traffic.
This migration will happen with a subset of beta customers at first, through the use of per-org flagging. We’ll begin with our own testing orgs, then with small batches of customers, with an immediate switchover rollback option in case of any hiccups. Meaning, risk is heavily mitigated with this approach. Customer subsets will be inverted time-zone migrated during your downtime (US during APAC/EMEA hours and vice versa, our team is global).
Our team currently has over 50 engineers working across our existing stack (35 people, all of the above) and new technologies (15 people). We are currently working on our new Event wizard, which is built from scratch by a dedicated team, our Live Events module (enormous, combines virtual and in-person event concepts), and Commerce (think Shopify for Salesforce), focused on self-serviceability of your own store for both digital (education) and physical goods. All of these will be out by late Summer.
I realize the above information does not fix the issues this second, but over the coming weeks, you’ll notice immense improvements in stability. Please reply to me with any and all questions and feedback, and I’ll loop in our relevant team members to assist.
We aim to be your trusted partner in all things Events, Payments, Messaging, and Commerce for Salesforce and appreciate your trust, confidence and patience in working with us.
firstname.lastname@example.org / 917.509.5971