I agree with everyone that it’s problematic that there was not a service status update earlier. For me, the issue has been going on most of the day, and less techy people who check the service status and see nothing could have spent hours thinking it was a problem at their end and trying to solve it. Other providers would have reported it was a problem at the provider’s end as soon as it became clear that was the case and there is no explanation as to why this did not happen with Runbox, now for a second time. I like Runbox and everything it stands for but this response is disillusioning in terms of respect for customers. No problem if there is a genuine issue, which I am sure there is, but more the feedback (or lack of it) that is problematic.
Thank you @Geir for the explanations. If your team was aware of any issues at 1100 CEST that was the time when a post on your status page should have been made. Please do that going forward.
Regarding successful IMAP logins, I would not take that as an indication of a successful connection. My machines are able to, occassionally, login via IMAP but then things grind to a halt. Downloading the current list of messages takes at least 20 minutes and/or times out. I am not getting any message bodies, merely subject lines, at best. No attachments are able to come through. Perhaps the volume of actual, meaningful traffic from the dovecot server might show up as being very low compared to your usual traffic averaged over an hour.
I wish you success mitigating the attack. I hope that you can put in place some economically-viable protections.
Thank you, @Geir for your post.
Please consider posting updates much earlier, it would be more reassuring for us customers. It doesn’t take long to post something like, “there’s a problem, we’re working on it”.
And it’s probably a high stress situation if there’s a DDoS attack, so you might consider making some sort of check list what to do when these things happen–and on that check list there’s also a point “inform our customers early” or something like that.
Good luck, DDoS attacks are difficult to mitigate and I’m sure you’re doing the best you can!
Also down all day in Southeastern USA.
I agree, @lukemb64 that’s a good point.
Thank you again for your comments and your support.
We absolutely agree that we should be more responsive with our customers here and in other channels during incidents.
Although these incidents always seem to hit at the worst possible time for our team this is no excuse and we should and will do better.
Please explain this:
Although these incidents always seem to hit at the worst possible time for our team
I first noticed this outage happening Tuesday at Noon, Norway time. That is the middle of your work day, isn’t it? No updates came for several hours.
There have been several outages over the years (yes, YEARS), but very little improvement in the response time. We’ve heard endless excuses for why things are done the way they are, such as “Before we say anything, we first investigate and confirm the problem”… but that action in itself should be posted immediately as an “Investigating” update. Then we know.
I say this with the utmost respect … you’re a dedicated, hardworking and passionate team. That’s why I’ve remained a customer for many years, despite my clients and partners complaining endlessly about this issue or that issue. I’ve remained loyal because I prefer to give my money to a small, dedicated team rather than a large, faceless corporation. But my patience has worn very thin.
Please — for the love of customer service — fix your response strategy before the next outage. You have endless examples from other companies that do it properly.
- Automated monitoring sends your team a notice within minutes of a problem.
- Immediately post an “[Investigating]” update to the Status blog, this Community site, and Twitter. Each venue reaches different users. It doesn’t matter if the issue affects 10 users or 1000 users. Each customer should matter, and a proper, transparent investigation matters. You were already notified of an issue from the (Step 1, Automated Monitoring), so that should be trusted.
It may sound like a lot of work to post the update to several places, so don’t. Just post the update to one place, and link there from the other sites. Easy.
I’ve heard responses like “NodePing isn’t reliable so it can’t be trusted”, so maybe spend some money on another solution that works better? StatusPage is one example.
In the responses over the years, there’s been statements like “Only affecting a small percentage of users”, but that’s assuming that each user just has one account. At the moment, I have 5 IMAP accounts set up my laptop, and 3 of the accounts were affected. Me, as a user, was affected, even if not all of my accounts were affected. So maybe refactor the response as being “accounts” instead of “users”?
Lastly, and I’ve suggested this in the past… post an RFO (Reason for Outage) after the outage has been resolved. This may seem pointless since the outage is now resolved, but some of us want to know more details about what caused an outage and what is being done to prepare for future similar outages. All too often, nothing is posted.
Promises are meaningless if improvements aren’t actually seen over time. Thank you for your continued work towards a better customer experience. It’s when things are not working when that matters most.
Geir, just one sentence on the service status page would be fine to start with, at the point the problem is noted, eg ‘there is a problem with email and we are working on it’. Why was this not possible, a genuine friendly question? Thanks for all the work on sorting the problem out!
We acknowledge and appreciate all your comments here.
This is just a quick update to let you know we are conducting a thorough review of this incident and our response procedures, and are producing a detailed improvement plan.
We will have more to post soon.
I know that Runbox is investing most of your time and energy into your Runbox 7 webmail client, which I appreciate since it’s replacing a broken and aging webmail client. I occasionally use webmail for accounts that I don’t have set up using IMAP.
But here’s why I use IMAP, which I hope helps to explain why outages are so impactful — because I have multiple accounts, and Runbox webmail only supports signing into a single account at a time.
I’ve posted about the slow loading of R7’s webmail interface due to it being designed around “heavy caching” of resources. If I need to sign into more than one account, then I use private windows, which means each new private window needs to reload the UI each and every time, which is very slow. The response I got back (from a certain someone on Twitter) was that I wasn’t using the webmail as it’s designed to be used. Ummm, hello?.. so it’s designed around the assumption that a single user has only a single account? I had originally given that feedback in hopes that the Runbox team would look into why the UI loads so slowly and fix that. (SVG graphics are a big contributing factor).
So bottom line is that IMAP allows us to have simultaneous access to multiple accounts at once, across multiple devices. R7 webmail does not offer that, at least not efficiently, since private windows conflict with the caching strategy. Improving the asset loading may solve that, though.
Looking forward to your response on outage handling going forward. Thank you.
Thank you @Geir for the work you have done and what you are still planning to do. IMAP is what we use 100% of the time across the four accounts we have. Webmail is only of interest when IMAP breaks or for Runbox-specific admin purposes. Webmail in general is a nuisance for us: no longer can search across all mailboxes (and across the rest of the Mac using Spotlight) and, above all, we lose all the deep macOS, iOS and iPadOS email integrations which we use several times per hour every day. Please prioritise your work on an excellent, reliable, and a well protected IMAP. The webmail you have is more than good enough for us, even if it is still work in progress, because it is just unneeded when IMAP works. However, I realise you have other use case scenarios.
As a single user I tried the webmail client for a long time, but in the end it just didn’t work for me. So I would like to add support to the previous statement that the quality of IMAP service is fundamental to me using Runbox.
I posted that response on Twitter because that’s the case. No Runbox interface ever has been intended to access more than one account at a time because Runbox was designed on the basis that you can direct all your email to one account and organise it there.
Now, times have changed and I can see the utility of making it possible to allow a single browser log in session access more than one of your accounts especially if you log in to the main account, but my reply was still a matter of fact. Making the interface access more than one account is a different issue so I suggest you submit it as an enhancement on GitHub and then we can look in to this
That’s also how I use Runbox. So normally I don’t use the webmail client at all.
The incident on March 29 affecting our IMAP/POP service caused interruptions for a large number of Runbox accounts between approximately 11 and 23 CEST. Extensive investigations made during and after the incident revealed that the interruptions were caused by brute-force login attempts against the Runbox Dovecot IMAP/POP servers from a number of IP addresses. This conclusion was difficult to reach because it was camouflaged by consequential errors on Dovecot proxy servers and authentication related issues.
A thorough review of our records and server logs indicate that the login attempts on our Dovecot proxy servers gradually increased from a normal level at 11 CEST by approximately:
135% between 12 and 13 CEST,
209% between 13 and 14 CEST,
308% between 14 and 15 CEST.
At this time our system administration team at Copyleft Solutions were alerted to IMAP/POP connection issues, ranging from slow connections to no connections at all. They proceeded to reboot the Dovecot servers and then the Dovecot proxy servers, which lead to consequential authentication and proxying issues for a short while. Subsequently an increasing number of successful connections was recorded, but the login problems mainly prevailed.
Further investigations revealed what appeared to be a targeted brute-force attack on our IMAP service that effectively denied legitimate connections from a significant number of users. Once a brute-force attack was ascertained, further analysis of server logs found a significant number of IP addresses reached our Dovecot services despite our automatic authentication based brute-force prevention systems.
Our system administration team subsequently blocked the most frequent IP addresses by adding them to the central firewall. This appeared to alleviate the situation as the login volume dropped to 246% between 17 and 18 CEST. There were however still connection timeouts to the Dovecot servers, and Runbox staff was alerted and a cooperative effort was initiated. Because staff routinely operate from different geographic locations to cover all time zones, and some at this time had varying cell phone coverage, this additionally delayed our response.
Investigations and attack mitigation continued and additional blocking of IP addresses was attempted in the firewall on the Runbox gateway servers. However, the attack volume thereafter increased to approximately 327% between 18 and 19 CEST and was sustained at approximately the same level for the next few hours.
The mail authentication service on the Dovecot proxy servers continued to experience issues, and investigations continued to determine whether these problems were caused by brute-force attacks themselves or by cascading issues in the infrastructure.
By 21 CEST the attack volume started decreasing to approximately 292% and subsequently to:
269% between 22 and 23 CEST,
200% between 23 and 24 CEST,
74% between 00 and 01 CEST the next day.
Around midnight Dovecot services were once again restarted and the situation then normalized, after which the login volume returned to a level similar to prior to the incident.
The incident was an isolated, however serious, event which calls for improvements on the handling of and response to such situations. Our company is in a sound position while growing and thriving, which means that we are in a position to facilitate necessary improvements.
Based on our experiences from this incident we are initiating adjustments and improvements to our service monitoring and alert instrumentation.
Our team has been determined to increase service monitoring to a level sufficient to gain a complete and comprehensive overview of their availability and reliability, putting us in a position where scaling and improvements can be implemented ahead of time. As our infrastructure has grown in size and complexity we currently uses a combination of Munin, Nagios, and Check_MK for this purpose.
However, especially in periods with high traffic these can in combination produce an excessive number of alerts daily which is challenging for our staff to effectively process. We are now working to converge on just one or two systems resulting in a smaller consolidated set of alerts to manage.
It can also be challenging to distinguish between real and false alarms, and we have therefore initiated a review of our system’s monitoring thresholds together with our system administration team.
We additionally utilize the external NodePing service availability monitoring, which checks whether registered servers are responding to requests from various geographic locations. If a Runbox service or server fails to respond within a given interval and number of rechecks, NodePing sends alerts via email, to our IRC server, and as text (SMS) messages.
These checks have also become a double-edged sword by being quite sensitive and in some cases identified short timeouts caused by servers briefly being in wait state. These may not be noticeable to end users and have in some instances drowned out alerts about actual outages.
In order to retain the information provided by these sensitive checks that help us identify and resolve intermittent IMAP timeout issues, we have added a new set of NodePing alerts with a higher interval and recheck number to attempt to alert Runbox staff via SMS about outages. Depending on the results from these changes we’re additionally considering apps such as Pushover to ensure that our staff becomes aware of the problem.
Longer term we are considering adding individual monitoring of additional internal services in order to get a clearer picture of arising issues and ideally before they become actual problems. These measures imply considerations regarding the interaction between Runbox staff and our system administration team at Copyleft Solutions.
We regret that our communication outwards to our customers during this incident fell short of what is adequate, and on a level that matched the seriousness of the situation.
Although we have already established communication lines for incidents, we are now expanding these with improved communication procedures for future incidents to ensure our teams are updating each other frequently. This in turn will allow us to update our customers better and more frequently as well, both via our own status page, our community forum, and on social media.
We are also establishing a new escalation procedure using several different communication channels with increasing alarm levels, integrating email, IRC, and various smartphone apps in addition to text messages and phone calls.
Again, we thank you for your continued support while we implement these improvements.
Thanks for the helpful feedback Geir.
Thank you for this incident report, Geir, and the measures that you’re taking to see improvements.
Once again, nobody expects perfect uptime, but we do expect a perfect response to incidents, including fast, informative communications.
If we know that you know, then it eases concerns from the very start.
Is the IMAP outage that occurring right now (April 22, very sporadic) included in the [Monitoring] blog post from April 16th? It’s not clear.
I encourage you to consider replacing the Status blog AND NodePing with one service that does both properly, such as StatusPage.io. Yes, there’s a cost to this, but it would be worth it!
The current Status blog is not done very well, and because it’s separate from the monitoring, it’s difficult to correlate status updates with status events over at NodePing. See my point? Bring them together under one roof.