18 March 2021
We joined Mark Chamberlain to better understand what Platform Services means for the future of BT.
A huge milestone in our digital transformation has been reached. We have now successfully migrated our EE Digital infrastructure capability to a new Platform Services environment. This shift started many months ago and it revolutionises the way we create and manage technical capability that all our product teams can use.
To delve deeper into what has been going on behind the scenes to make this all possible, we caught up with Mark Chamberlain, DevOps Manager within the Platform Services team, to understand what his team has put into the migration and what this achievement means for the future of BT.
Can you talk us through what your team does and the journey you’ve been on up until now?
We’ve been called Platform Services for about 3 years now - I've lost track of time with last year to be honest - but we've always been responsible for the underlying infrastructure of ‘EE shop’ and ‘My EE Web’ and various other things as well, since about 2013 when we moved from on-premise hardware, into Amazon Cloud.
The first original lift and shift of what was going on, on our on-prem stuff - which was very, very expensive to maintain – took everything into Amazon Cloud. Then in about 2015, it wasn’t working quite as expected, so we tried to break things out. The first attempt was to break out the EE shop from the new Cloud, into another cloud and that went well, but again, there was a few little things that weren't quite right with that. So, then we formed this new team called ‘Platform Services’ that was a ground-up rebuild effectively. It had all the latest and greatest technologies and we went live in 2019 with the ‘EE Coverage Checker’, which was the first live service in production that we had on our brand-new architecture and infrastructure.
Then around 14 months ago we said ‘Right. Now is the time to break out all the big guns (i.e. My EE Web and EE Shop), which are our real money spinners, and get those into the new platform services architecture, because the architecture and systems that they were on were very, very expensive and not very flexible. We’re now trying to empower squads and tribes themselves to just be able to log onto the site and basically spin up their own environment. If they want to test something, they can now do it within minutes. I think it's 20 minutes they can spin up the entire EE Shop, for example, under their own name. It doesn't impact production; it doesn't impact anybody else. They've got their own play area to play with if they want to develop something.
We're giving people the empowerment to actually use their own environments and be responsible for them as well, so we can bill accordingly, so we can bill whatever tribe for whatever they're using. So yeah, we’ve broken it out a lot more. The systems they’re running on now are far faster, far more flexible and ultimately cheaper as well. So, all in all, we're using the latest and greatest technologies, giving people the ability to do this for themselves rather than relying on - if you like – ‘infra geeks’ to almost do it for them.
We’ve been giving this to people now for a few months and what we’ve been hearing from the My EE Web team is: “this is just what we've been after for so many years!” and that's why we did it 14 months ago for this particular project. It’s why we designed it the way it is. Because over the course of the eight years since we went from on premise to cloud, there’s been a load of lessons learned and we knew what we really had to build to please the majority. So, we’re very pleased with the way it's gone. I think one of the biggest compliments we could get was from the leadership team, who didn't notice that we’d put this whole thing in because there's been no outages, there's been nothing since we went live last Sunday morning so it was a big relief to get in last week!
How has the pandemic affected the migration?
It's been really testing actually - we were meant to go live last April, originally. We built out the staging environment last April and then obviously COVID hit and we had a ‘code free’ period so we weren't allowed to do anything for a couple of months. And then you've got the iPhone launch and the company always wants stability round the time of the iPhone launch in September. And of course, there were multiple iPhone launches this year, so we’ve had multiple stability periods. Before you know it, you’re in October, November time and the pressures on! So, then you say ‘Right. Let's get this in before Christmas.’ So, we had a go around the 10th December and that was the first failed attempt. These things never work first time and if it had gone in first time it would’ve been a miracle! Obviously then there's our own furlough (not to be confused with COVID furloughs) So, we had a furlough period where our contractors don't work. And then before you know it, you're in between Christmas and January and it's like, ‘okay let's give it another go.’ I mean, you see that it’s been challenging to manage, but luckily our teams are co-located in Leeds, but also we've got a big partner over in Belarus and Ukraine and they've been developing the EE shop and My EE migration bits of it, whilst the systems support aspect is managed by another large partner over in Bangalore, India. From a pandemic POV, because we work remotely with them anyway, it's not been that bad. But having said that, just at the crucial point, roundabout mid-December/early January, we had four of the team off with COVID - two of which were hospitalized, frighteningly. And then of course that puts extra pressure on the rest of the team, who were still there. So yeah, the people who managed to get it in did such a super job because we've been under tremendous pressure to get it in. So yeah, we've been gagging to get it in basically!
How will this migration improve the way we work?
Teams now don't have to wait three weeks for us to create an environment. They can spin it up themselves, because we're giving them the power to go onto a website and fire up an EE shop for themselves with no come back to actual production. So, it's completely separated, and they can do what they like. They can start developing on that within a couple of hours rather than a couple of weeks, which was the case before. So, you’re saving so much time there. And we’re contributing to growth in as much as we’ll also be saving a lot of money; growth in as much as the individual squads and tribes will be all developing under the same umbrella of technology: if you do something in one squad, using our model, then you go to another, the code will be instantly recognisable because everyone should be using the same technology going forward.
Historically, we'd always be relying on individual servers, and it's like a real tangled web of technologies! Whereas with the headless stuff, ultimately all the product owner only needs to know is one address, fire all requests at that and it's abstracted beneath it. It's almost like an umbrella whereby, you as the product owner, only need to know one address and everything else is contained within that. So, it just abstracts things to a degree where it's just so much easier for product teams to develop for it, because everything else is done under the hood. You don't need to know load balancing, you don't need to know Hybris web servers or app servers or anything else like that. It's just ‘There you go. Point it at that and it will take care of everything else.’ So that's all being done under what's called ‘Kubernetes clusters’, which is basically taking individual applications and then making them virtual and putting them in little containers - it's called containerization. You put them in these containers, making them all completely autonomous from each other, so you can mix and match. You just bundle them all together, wrap it in this Kubernetes cluster and then point whoever needs to go in it at the one address and then everything just works.
Because the development of this Kubernetes cluster technology is supported by some of the biggest companies in the world, it will upscale, it will downscale all automatically. So, for example during an iPhone launch if things were getting a bit hot, historically the website might actually crash. Our technology now will allow it to just automatically scale up. It'll bring in new things so it will just take the load – we’ll never be in a position where we’re going ‘Oh my God, it's all falling down around us!’ And vice versa, during really quiet periods, it could be done so that it scales down as well.
What would you do differently if you were to do it again?
I would try and not take on so much so quickly. I would have probably tried to stagger it. We did discuss it at the time, but the problem was, due to technological reasons, we couldn't separate out the EE Shop and My EE Web. We had to go live with them on the same night and if there was any way we could have separated them out, that's what I would do differently.
What’s the next milestone for your team?
We've always said that once we got it in, we’d need to start breaking down some of the big, more monolithic services that we have. For example, we've got something called Hybris, which is an SAP product, that powers the shop, but it is a real monolithic service. We need to start breaking that down so that we don't have to do massive releases all at once. Because it would be great to just take a little bit of that; say, ‘you know what, we need to upgrade that tiny bit or this tiny bit’, but at the minute we have to upgrade the whole lock, stock and two smoking barrels of it! It’d be really, really good if we could just break it down into little services, and then we can upgrade what we need to.