IMAP and POP outage ⛔️ – Nov 2

nigeldalton · November 2, 2022, 5:38pm

My single account has just come back on line. But I agree that not responding to Customers is just disrespectful.

FredOnline · November 2, 2022, 5:46pm

If Runbox is ‘mission critical’ for you, it is perhaps better to contact them directly at support@runbox.com - who knows if someone in Support is actually monitoring this forum at any given time.

Jdyton · November 2, 2022, 5:58pm

Yes my account came back online. Only have single account as well.

TheDigitalOrchard · November 2, 2022, 6:15pm

@FredOnline

They have a Status page where they post network status updates, and NodePing that clearly shows an outage is occurring, along with Twitter and this Community forum.

Time and time again, an outage is occurring, NodePing is reflecting that, yet there’s not a single peep from Runbox anywhere. That’s the concern, and it’s played out over and over again, despite their promises for improvements.

Many companies execute network status updates perfectly. But Runbox continues to hide behind excuses of one form or another. Often that is “We only post something after our investigation”… but sorry, it shouldn’t take more than 2 hours do an investigation when NodePing is clearly showing an issue that users are experiencing.

Still don’t know why Runbox can’t figure this very basic process out.

Outage begins, create a post with a status of “Investigating”.
Problem identified, post an update with a status of “Identified”.
Problem fixed, post an update with a status of "Resolved* and “Monitoring”.

Systems exist (eg. PageStatus) that do this automatically, but they seem to refuse to move away from their confusing blog-style Status page. Is it a cost issue?

frombeyondspace · November 2, 2022, 6:22pm

Every time this happens I try to rationalise it as being acceptable in the sense that a service cannot stay up for 100%. It is however frustrating that they do not seem to improve on their response to these events and the previous outage still doesn’t list a post mortem. It would be helpful for me as a customer to understand why these outages happen and how they are mitigated/managed going forward. Otherwise I feel uncertain whether I should continue being a customer… despite wanting to.

Jdyton · November 2, 2022, 7:20pm

They finally posted a short status report about intermittent IMAP and POP. NodePing shows POP offline, I guess that means they are working on it…

Geir · November 2, 2022, 7:41pm

We have indeed implemented many of the improvements mentioned previously, including utilizing dedicated apps to alert our team about incidents. We are investigating why this did not function as expected in this case as it appears no alerts were sent.

Our system administrators have been working to resolve the access issues since around 17:00 CEST, but we agree that this should have been communicated right away.

We appreciate your comments here and will continue making improvements to our incident response procedures.

In the meantime we apologize for the lack of communication about this incident.

– Geir

TheDigitalOrchard · November 2, 2022, 9:39pm

Thanks @Geir

To be painfully clear — we don’t expect perfect uptime, but we do deserve a perfect response to incidents. We all understand it can take time to resolve issues, but we need to know that you’re working on them — immediately. Even if it’s a false alarm in the end, some type of “Investigating” status report should be posted immediately.

These incidents often happen in the middle of North American work days (which is the end of Norway work days), so this needs to be factored in, too — where your users are located. Could it be a load issue as many more devices begin checking email again?

Can we get that less-than-useful Status blog replaced with something more official? Suggestion would be to subscribe to something like StatusPage, host it at status.runbox.com (outside of Runbox infrastructure) and sunset NodePing. Bring both the monitoring and updates together into one expected place.

Geir · November 5, 2022, 9:41pm

@TheDigitalOrchard We agree with this, and although we have improved our incident response over the past few months we are not satisfied with our response time in this particular case.

As you suggest, slowness or disruptions sometimes occur when European and US business hours overlap, which is the time of day our service experiences the most traffic and consequentially load.

As our service continues to grow we are evaluating improvements to our architecture beyond mere horizontal scaling in order to remove bottlenecks and prepare for additional customer growth.

Our team is normally spread across several time zones in order to better monitor our services, respond to incidents, and provide timely customer support. In this case there was an unusual gap in our coverage, which in combination with recent updates to our external alerting systems that inadvertently prevented the alerts from reaching our team, resulted in delayed internal and external communication.

As mentioned our system administrators were aware of the situation and working to resolve it, and we regret that this was not communicated publicly. Following this incident we have further improved our inter-company communication lines to ensure that all necessary resources are responding.

We are also again evaluating services such as StatusPage and whether there are open source or self-hosted alternatives. In our experience NodePing provides the most reliable monitoring service so we are continuing to utilize that.

– Geir

TheDigitalOrchard · November 7, 2022, 4:34am

Thanks @Geir. I see that you’ve now replaced your Status page with a WordPress-based site combining NodePing and status updates. While I don’t think that WordPress is an ideal choice, at least you’ve hosted that site out of Texas and not part of your other Runbox infrastructure, so that’s good! Looking good and definitely moving in the right direction! Thanks for your hard work, and glad to hear that the pain is due to business growth success, not decline. Keep up the great work!

FredOnline · November 7, 2022, 10:46am

Our team is normally spread across several time zones in order to better monitor our services, respond to incidents, and provide timely customer support. In this case there was an unusual gap in our coverage, which in combination with recent updates to our external alerting systems that inadvertently prevented the alerts from reaching our team, resulted in delayed internal and external communication.

Perhaps, in addition to your ‘team’ you could consider adding some of your customers in a far flung corner of the world, as ‘priority contacts’ in that, if you receive downtime reports from them, you can trust them and escalate the problem quicker.

TheDigitalOrchard · November 7, 2022, 7:20pm

November 7th — there seem to more “brief outages” occurring.

@Geir — The changes made to the NodePing page (“Detailed Service Status”) make it impossible to see specific details now. For some reason the “automated log” records are no longer visible. Any idea why those were removed? They are useful to see a more granular history of the status updates.

Geir · November 7, 2022, 10:56pm

@TheDigitalOrchard and @FredOnline,

Thanks for the feedback and support!

As you can tell we are experimenting with a new format for the status.runbox.com page, taking into consideration transparency and GDPR related concerns (which is also the case with Atlassian’s StatusPage service).

We can certainly include more details from NodePing again as the main reason they were removed was to leave space for event updates. Would you prefer event updates above or below the integrated service status from NodePing?

The suggestion to add customers from various time zones to a trusted group that can help monitor services might be a good one, and one we have considered in the past. This would not be a solution to incident responses in itself, but could be beneficial to both our service and our customers at least in a transition period.

Any volunteers?

– Geir

TheDigitalOrchard · November 7, 2022, 11:20pm

Volunteer — Ted @ The Digital Orchard
West Coast of Canada, Pacific Time Zone (PST/PDT)

Further suggestions:

status.runbox.com — Looks good with just the NodePing table and manual updates, but suggest that DATE and TIME be very clear. Currently, a date is missing, or scattered throughout the incremental updates, which is confusing. Using a timezone of CET is fine, but lacks context for those in other timezones, so a relative timestamp can be useful (eg. “5 minutes ago”, “3 hours ago”, etc.). WordPress likely has plugins to support this. Presently, those of us in other timezones would need to manually translate the times posted to understand when an update was made, and that’s not helpful.
Detailed System Status (NodePing page) — Display the automated log updates here to explain why the uptime of a given service is less than the industry-standard 99.9% goal. This was useful to provide details before, during and after your manual updates.

TheDigitalOrchard · November 7, 2022, 11:37pm

I just want to provide some context for my own postings here, and on Twitter.

I hold Runbox to a very high standard for one very simple reason — Runbox is the chosen email provider for my businesses (3 of them!) and our clients’ businesses (several!). While the service itself lacks a proper “Reseller/Tenant” setup and business-oriented configuration options, your own website advertises that the service is designed to support businesses.

So when an outage happens, it’s not just affecting one less-than-critical personal account that we may have. It has a broad impact across many businesses all at once, and we get an influx of questions from our customers about why their email is not working. When we don’t get timely updates from Runbox, we don’t know what to tell our customers. It’s that simple.

We are not reselling Runbox, but we are utilizing Runbox as your own website advertises — to provide business-class email services. Our clients are not technically-minded. They rely on us to set up their email services, and Runbox fits the bill for the most part. We just need a higher standard for how outages are handled — and you’re doing a great job at improving that. Thanks!

Geir · November 9, 2022, 8:03am

Thanks for the feedback @TheDigitalOrchard – we are continuing to make adjustments to the status page based on your suggestions:

https://status.runbox.com/

We completely understand your comments from a business perspective and sincerely appreciate you choosing Runbox for your own and your customers email service needs.

We provide services to many small businesses like yourself for whom email is a critical service. This is a market segment we are looking to expand further into, and precisely for that reason your constructive criticism is all the more important to us.

We are evaluating various alternatives for closer communication with regards to monitoring and responding to incidents, although our overarching goal is to minimize the need for such responses altogether.

Thanks again for your continued support from our entire team.

– Geir

Geir · November 11, 2022, 7:31am

We have made some minor additional adjustments to the Service Status page, including the addition of the US EST time zone in the most recent post.

We’ve attempted to find a WordPress plugin that will convert timestamps into other time zones but this does not appear to exist, so we will have to manually include other time zones in each update.

– Geir

FredOnline · November 11, 2022, 12:50pm

Perhaps also consider having the links in the standard blue color - having them in black makes it difficult to tell what is a header and what could be a link.

Geir · November 14, 2022, 11:37pm

@FredOnline Agreed – now done!

– Geir

mangoman · November 17, 2022, 10:04pm

POP access is down.

NOV 18th 2022.