Any ETA for IMAP restoration?

bearklaw · February 5, 2022, 5:54am

IMAP has been been down for 7 or 8 hours now, and that is confirmed by the status page at NodePing - Monitoring. Any ETA for getting it back up and running?

Rafal · February 5, 2022, 8:44am

It would be good to update status.runbox.com with additional information on the ETA. Such a long outage is very unusual.

Peter · February 5, 2022, 9:11am

Yes, an update, even a small one like “we’re working on it” would be good. Psychologically it would make all the difference.

nigeldalton · February 5, 2022, 10:08am

I’m afraid they’re not on the ball today. Must be a big problem for this to go on so long.

Geir · February 5, 2022, 10:25am

We apologize for the downtime and our system administrators are working to restore IMAP and POP services.

– Geir

Peter · February 5, 2022, 10:31am

It almost took 12 hours to acknowledge that there is a problem. Why did it take so long? Handling incidents like this doesn’t increase the trust we have in Runbox…

Rafal · February 5, 2022, 10:31am

I just saw that your status page has been updated to say “Some users” have problems accessing IMAP. However, according to the nodeping page the IMAP service has been out since midnight, which sounds like no one has been able to access it. What is the case? Please be clear, timely, and truthful with status updates—trust is a precious commodity. Why has is it taking so long? What was the cause and how will you avoid it in the future?

I have just moved from Tuffmail to Runbox, I have never experienced such a long outage in over 10 years with them. Even minor ones were clearly disclosed. I look forward to many years with Runbox if trust is maintained.

Good luck with the technical fixes, but please also fix the trust with your customers.

Peter · February 5, 2022, 10:33am

I fully agree with what @Rafal has just posted.

Geir · February 7, 2022, 11:15pm

Thank you all for posting your comments here.

At the time our status page was updated it was not entirely clear how many users were affected. The NodePing check is routed to one of many IMAP/POP servers and does not necessarily indicate the extent of the outage.

The underlying cause was related to a deployment of database logging of authentication attempts a few days earlier that caused increasing delays in responses to authentication requests.

Since this was uncovered we have been doing three things in cooperation with our server management team :

Rolled back the authentication logging to resolve the immediate IMAP/POP related outage, awaiting investigation of the authentication database performance related to MySQL read/write capacity.
Continued our analysis of the underlying causes of intermittent IMAP/POP access issues that masked alerts about the above and that contributed to our late response. This includes several avenues relating to ZFS filesystem configurations, encryption, and disk I/O on our storage servers as well as the resource usage of our IMAP/POP (Dovecot) installation.
Reviewing our service monitoring systems and incident procedures in order to improve our response times and communication routines.

After having been in this business since 2000 we understand well that our relationship with our customers is based on trust, and that this is something that must be continuously earned.

We appreciate your continued support and will do our utmost to deserve your trust going forward.

– Geir

Rafal · February 8, 2022, 7:47am

Thank you, @Geir, I appreciate your work and the reply.

Peter · February 8, 2022, 8:13am

Thank you, @Geir, for the explanation.

FredOnline · February 8, 2022, 11:17am

Good answer!

My experience with Runbox over many years, is that they provide excellent support, and that’s why the failure to update customers at the time of the outage was particularly worrying. I would like to think that the majority of your customers are understanding when things go wrong - providing we are being kept in the loop, and we know that you are actually working on resolving any problems. The continued silence over many hours was very disappointing, as I know you can do better.

bearklaw · February 8, 2022, 6:35pm

Thanks @Geir, I appreciate the detailed reply. I agree with @FredOnline - I trust Runbox, which is why the lack of information seems so disconcerting. Thanks for sharing the details of the problem.

coolfactor · February 9, 2022, 7:48am

Appreciate that you’re making a concerted effort to improve the infrastructure. I encourage you to look beyond traditional approaches. For example, is MySQL needed in order to store authentication logging? Could a much more performant type of database work better? How about a super-fast key-value database instead of a relational database?

I said it for years … distribution of resources across dedicated servers is always going to give you the b best performance and options for management, but this feedback was always knocked down as though your team knows better than a customer. But now you’re in the process of setting up dedicated processes to handle things like authentication. So while you’re at it, I encourage you to really look hard at the wealth of technologies that we have today. For example, I’ve found MongoDB performance way better than MySQL. Both are fast, but MongoDB has just been that much better. Use cases will vary, though.

Keep up the good work. To be a leading-edge provider, don’t look backwards. Break the form, much like you’re trying to do with RB7. Long journey with that one, though. Don’t forget about us IMAP users while you’re at it.