IMAP and POP outage ⛔️ – Nov 2

TheDigitalOrchard · November 2, 2022, 3:29pm

Just opening up a discussion forum on the current outage. Eagerly await updates.

TheDigitalOrchard · November 2, 2022, 3:37pm

Got 10 IMAP accounts set up in Mail.app. Of those, 6 are not authenticating.

Jdyton · November 2, 2022, 3:57pm

IMAP and POP both down in Northeastern USA.

nigeldalton · November 2, 2022, 4:44pm

It doesn’t look like the notifications and updates promised by Geir in the previous outage situation have been carried out. Most disappointing and makes me think I ought, reluctantly, to look for a more reliable provider.

Jdyton · November 2, 2022, 5:18pm

Seems to be more frequent problems. So far Runbox isn’t giving any updates. Beginning to wonder what is going on.

TheDigitalOrchard · November 2, 2022, 5:22pm

Agreed. This is industry-standard stuff that many companies do exceptionally well. Yet Runbox continues to let us down in this regard.

TheDigitalOrchard · November 2, 2022, 5:34pm

Update, only two of my accounts are now not authenticating, so this doesn’t appear to be complete outage, but more likely a server load issue. It’s having trouble handling all inbound requests, therefore some are failing. i’ve noticed an improvement from 6 to 2, so that would make sense.

@Geir — Let’s get a proper response, please. It’s been going for 2 hours with nothing but silence from Runbox. That’s not acceptable.

nigeldalton · November 2, 2022, 5:38pm

My single account has just come back on line. But I agree that not responding to Customers is just disrespectful.

FredOnline · November 2, 2022, 5:46pm

If Runbox is ‘mission critical’ for you, it is perhaps better to contact them directly at support@runbox.com - who knows if someone in Support is actually monitoring this forum at any given time.

Jdyton · November 2, 2022, 5:58pm

Yes my account came back online. Only have single account as well.

TheDigitalOrchard · November 2, 2022, 6:15pm

@FredOnline

They have a Status page where they post network status updates, and NodePing that clearly shows an outage is occurring, along with Twitter and this Community forum.

Time and time again, an outage is occurring, NodePing is reflecting that, yet there’s not a single peep from Runbox anywhere. That’s the concern, and it’s played out over and over again, despite their promises for improvements.

Many companies execute network status updates perfectly. But Runbox continues to hide behind excuses of one form or another. Often that is “We only post something after our investigation”… but sorry, it shouldn’t take more than 2 hours do an investigation when NodePing is clearly showing an issue that users are experiencing.

Still don’t know why Runbox can’t figure this very basic process out.

Outage begins, create a post with a status of “Investigating”.
Problem identified, post an update with a status of “Identified”.
Problem fixed, post an update with a status of "Resolved* and “Monitoring”.

Systems exist (eg. PageStatus) that do this automatically, but they seem to refuse to move away from their confusing blog-style Status page. Is it a cost issue?

frombeyondspace · November 2, 2022, 6:22pm

Every time this happens I try to rationalise it as being acceptable in the sense that a service cannot stay up for 100%. It is however frustrating that they do not seem to improve on their response to these events and the previous outage still doesn’t list a post mortem. It would be helpful for me as a customer to understand why these outages happen and how they are mitigated/managed going forward. Otherwise I feel uncertain whether I should continue being a customer… despite wanting to.

Jdyton · November 2, 2022, 7:20pm

They finally posted a short status report about intermittent IMAP and POP. NodePing shows POP offline, I guess that means they are working on it…

Geir · November 2, 2022, 7:41pm

We have indeed implemented many of the improvements mentioned previously, including utilizing dedicated apps to alert our team about incidents. We are investigating why this did not function as expected in this case as it appears no alerts were sent.

Our system administrators have been working to resolve the access issues since around 17:00 CEST, but we agree that this should have been communicated right away.

We appreciate your comments here and will continue making improvements to our incident response procedures.

In the meantime we apologize for the lack of communication about this incident.

– Geir

TheDigitalOrchard · November 2, 2022, 9:39pm

Thanks @Geir

To be painfully clear — we don’t expect perfect uptime, but we do deserve a perfect response to incidents. We all understand it can take time to resolve issues, but we need to know that you’re working on them — immediately. Even if it’s a false alarm in the end, some type of “Investigating” status report should be posted immediately.

These incidents often happen in the middle of North American work days (which is the end of Norway work days), so this needs to be factored in, too — where your users are located. Could it be a load issue as many more devices begin checking email again?

Can we get that less-than-useful Status blog replaced with something more official? Suggestion would be to subscribe to something like StatusPage, host it at status.runbox.com (outside of Runbox infrastructure) and sunset NodePing. Bring both the monitoring and updates together into one expected place.

Geir · November 5, 2022, 9:41pm

@TheDigitalOrchard We agree with this, and although we have improved our incident response over the past few months we are not satisfied with our response time in this particular case.

As you suggest, slowness or disruptions sometimes occur when European and US business hours overlap, which is the time of day our service experiences the most traffic and consequentially load.

As our service continues to grow we are evaluating improvements to our architecture beyond mere horizontal scaling in order to remove bottlenecks and prepare for additional customer growth.

Our team is normally spread across several time zones in order to better monitor our services, respond to incidents, and provide timely customer support. In this case there was an unusual gap in our coverage, which in combination with recent updates to our external alerting systems that inadvertently prevented the alerts from reaching our team, resulted in delayed internal and external communication.

As mentioned our system administrators were aware of the situation and working to resolve it, and we regret that this was not communicated publicly. Following this incident we have further improved our inter-company communication lines to ensure that all necessary resources are responding.

We are also again evaluating services such as StatusPage and whether there are open source or self-hosted alternatives. In our experience NodePing provides the most reliable monitoring service so we are continuing to utilize that.

– Geir

TheDigitalOrchard · November 7, 2022, 4:34am

Thanks @Geir. I see that you’ve now replaced your Status page with a WordPress-based site combining NodePing and status updates. While I don’t think that WordPress is an ideal choice, at least you’ve hosted that site out of Texas and not part of your other Runbox infrastructure, so that’s good! Looking good and definitely moving in the right direction! Thanks for your hard work, and glad to hear that the pain is due to business growth success, not decline. Keep up the great work!

FredOnline · November 7, 2022, 10:46am

Our team is normally spread across several time zones in order to better monitor our services, respond to incidents, and provide timely customer support. In this case there was an unusual gap in our coverage, which in combination with recent updates to our external alerting systems that inadvertently prevented the alerts from reaching our team, resulted in delayed internal and external communication.

Perhaps, in addition to your ‘team’ you could consider adding some of your customers in a far flung corner of the world, as ‘priority contacts’ in that, if you receive downtime reports from them, you can trust them and escalate the problem quicker.

TheDigitalOrchard · November 7, 2022, 7:20pm

November 7th — there seem to more “brief outages” occurring.

@Geir — The changes made to the NodePing page (“Detailed Service Status”) make it impossible to see specific details now. For some reason the “automated log” records are no longer visible. Any idea why those were removed? They are useful to see a more granular history of the status updates.

Geir · November 7, 2022, 10:56pm

@TheDigitalOrchard and @FredOnline,

Thanks for the feedback and support!

As you can tell we are experimenting with a new format for the status.runbox.com page, taking into consideration transparency and GDPR related concerns (which is also the case with Atlassian’s StatusPage service).

We can certainly include more details from NodePing again as the main reason they were removed was to leave space for event updates. Would you prefer event updates above or below the integrated service status from NodePing?

The suggestion to add customers from various time zones to a trusted group that can help monitor services might be a good one, and one we have considered in the past. This would not be a solution to incident responses in itself, but could be beneficial to both our service and our customers at least in a transition period.

Any volunteers?

– Geir