Another Outage

nigeldalton · November 10, 2022, 3:21pm

Runbox down again - this is becoming really annoying and I want to know what specific steps are being taken to keep your Customers on line because other providers, as far as I can see, don’t appear to have these problems.
I look forward to something positive from you.

Geir · November 12, 2022, 7:53am

Hello @nigeldalton and thanks for your message.

We understand the frustration and can assure you that we are working to resolve the problem, which has proven to be more complex than we could have anticipated.

In short, the issue appears to be caused by a small piece of software deep in our service’s architecture that is responsible for patching requests between our IMAP/POP services and our authentication service.

We are now working to redesign and rewrite this software to make it run more efficiently and thereby avoid the intermittent timeouts we are seeing during peak traffic hours.

We will not stop until this problem is resolved, and we might also add more servers to alleviate the problem that way.

Thanks for your continued support while we continue working to resolve this problem

– Geir

nigeldalton · November 12, 2022, 11:11am

Thank you for the response Geir, I am somewhat relieved that you have identified the problem and that your team is working to fix it.

nigeldalton · November 17, 2022, 2:28pm

… but it’s down again. Sorry to keep harping on - I’m sure you’re as frustrated by it as your Customers are.

mangoman · November 17, 2022, 10:02pm

Runbox POP access is down for more than 5 hours at least.
What is going on this time?

nigeldalton · November 18, 2022, 11:21am

It’s been back up but got down again now!

Rafal · November 18, 2022, 11:42am

IMAP is down again, just at the busiest time when we need it the most. Too many too long outages way too often. This is not a professional level of service.

I am starting to fear I will have to find a more reliable provider, such a pity Runbox cannot solve this issue.

nigeldalton · November 22, 2022, 3:29pm

We’re down again. How much time does it take to fix a problem like this if it’s been identified please?

TheDigitalOrchard · November 22, 2022, 4:43pm

And worse than that, they haven’t provided a timely update! After they promised that they would.

What the heck is going on over there @Geir?

TheDigitalOrchard · November 22, 2022, 4:54pm

There’s a brief update posted under the existing status blog post:

[Monitoring] IMAP and POP access issues

UPDATE 2022-11-22 16:00 CET (10:00 AM EST): We are continuing analyses of timeout errors and considering options for improved balancing of server load related to IMAP/POP proxy connections.

This is not how it should be done. This is a new outage, so it deserves a new post. At the very least, figure out a way to put a date into the topic headline so that we can see immediately that you’re aware of this outage. Don’t make your customers dig for information.

TheDigitalOrchard · November 22, 2022, 5:33pm

@Geir — Help us out, here. We’re in the middle of our morning on the West Coast of North America… and we are without our email. The NodePing page is not being very helpful, and once again, no manual updates from Runbox. Are you working on this outage? What is the problem? What is the progress being made? These all need to be communicated frequently… like every 20 minutes or so… to give customers confidence.

matthew · November 22, 2022, 5:38pm

Anything? It’s been 20 days since these issues started, and I’m getting frustrated with the lack of updates as a customer.

I’ve been a happy Runbox customer for the past couple of years, but I need my email working. I would rather avoid finding another provider.

Geir · November 22, 2022, 5:46pm

@TheDigitalOrchard Yes, our system administrators are actively working on this problem along three dimensions:

Investigating authentication related bottlenecks which may indirectly be affecting the entire service.
Moving IMAP/POP pre-authentication services to separate, dedicated servers to offload IMAP/POP proxy servers.
Assessing the need for additional hardware in order to scale the architecture horizontally.

We are fully aware of the pain this is causing you and other customers as our own team rely on the same services to operate, and we are engaging all available resources towards resolving the problem.

The reason we have decided to update the same status post is that the ongoing issue is the same as before the history is therefore relevant.

We will continue working to improve the frequency of updates as we receive details from our system administration team.

– Geir

TheDigitalOrchard · November 22, 2022, 5:54pm

Thanks @Geir.

Your intended solutions are exactly what I suggested years ago during past outages. I saw this coming, and yet my suggestions were continually swept aside as though Runbox knew best, and I was “just a customer”. It was frustrating. Seeing that you’re finally making these horizontal-load improvements shows that you’re learning from this. Thank you.

Now, I’m not suggesting that I’m an expert in every area, or pretend to know your deep inner workings, but here are some further ideas:

Place IMAP onto its own server infrastructure (imap.runbox.com)
Place POP3 onto its own server infrastructure (pop.runbox.com)
Place SMTP onto its own server infrastructure (smtp.runbox.com)

Maybe it’s not technically possible to split it up like this given how the mail software works, but if Runbox was truly trying to be a next-generation email provider, you’d consider all options. Move beyond how existing mail software works and come up with ingenious new solutions that remain compatible with mail clients.

The web-based Runbox7 is neat and all, but when basic underlying mail services are going down frequently, that should be getting the full attention, then build R7 on top of that improved infrastructure.

FredOnline · November 22, 2022, 5:56pm

The reason we have decided to update the same status post is that the ongoing issue is the same as before the history is therefore relevant.

I’ve raised this problem in another thread, a cursory glance makes one think there’s been no problems since Nov 2nd. And no doubt it looks better that way. It’s only when you dig in you find more information.

Geir · November 22, 2022, 6:28pm

@TheDigitalOrchard We have continued to scale our infrastructure when needed, while working to utilize available resources as efficiently as possible.

The Runbox service is distributed among at least 18-20 different virtualized service clusters (load balancing, IMAP/POP, MX, authentication, web, spam scanning, etc) and at this point it isn’t clear that the timeouts are caused by insufficient virtual or physical capacity.

Email systems are more complex than they might appear, posing a unique mix of challenges relating to reliability, traceability, and perforrnance. Strategically we attempt to utilize open source systems and standards to the maximum extent while implementing customizations that increase performance and/or flexibility long-term that allows us to develop Runbox further.

For instance, IMAP and POP are closely connected as they are both provided by (our customized version of) Dovecot, while both Runbox 6 and 7 are decoupled and access the central database and file storage independently (while in many other services webmail is simply an IMAP client).

We will resolve these problems and our goal is to prepare and scale our infrastructure in advance of any future performance related issues arise in the future.

– Geir

Geir · November 22, 2022, 6:30pm

@FredOnline The only reasons we have updated the same status post is efficiency and that the same issues are ongoing as mentioned.

We have nothing to hide, as demonstrated by the NodePing status having been integrated at the top of the main status page.

– Geir

nigeldalton · November 22, 2022, 6:38pm

I’m pleased I’m not technical like some guys on here seem to be. However, basic Customer Service in my book says that you need to say 1) we are aware of the problem, 2) we know what it is (if you do) and 3) we will have it fixed in 24 hours, 48 hours or truthfully whatever the date is. You just can’t expect clients to keep pestering for information.

Peter · November 22, 2022, 6:47pm

I think that’s reasonable. It would still be more user-friendly to show the last updated date of the post on the front page, though.

Geir · November 22, 2022, 7:04pm

We have now reversed the order of updates on the status post to make it clearer that we are working on the issue:

https://status.runbox.com/2022/11/02/imap-and-pop-access-issues/

– Geir