Insights for Easing the Transition to the Cloud
House of Brick, the services division of OpsCompass, has a growing set of white papers, webcasts, and blog posts on our experience of replatforming to AWS. The pandemic has induced a substantial, incremental acceleration into the cloud. That despite the innocent belief on the part of those of us who observe cloud implementation most closely that no further acceleration could have been possible. Following re:Invent 2019, I blogged on the experience of Amazon.com moving from Oracle to AWS. The post is rich with insights for easing the transition to the cloud, especially cost savings, project management, and organizational dynamics.
First, we’ll review the replatform of the Amazon.com data warehouse off Oracle. Then we’ll review Amazon.com’s successful unload of its ~7,500 non-bundled Oracle databases.
I suggest you organize three lunch and learns to play these sessions. It so happens that re:Invent has made what I consider to be a great gift: no fee registration or even login is needed to view these sessions.
Amazon.com’s Data Warehouse Replatform from Oracle to AWS Redshift
That’s what I think the session should have been titled to grab appropriate attention. Here’s the session’s actual title: ‘How Amazon leverages AWS to deliver analytics at enterprise scale.’ (Yawn.)
Amazon’s Oracle RDBMS data warehouse had several hundred petabytes of data, ~900K jobs, ~38K tables, and ~80K active users.
As Amazon.com is an AWS customer, they used the same APIs and documentation to pull this off as any other AWS customer would.
Amazon.com’s data warehouse operation had invested heavily in specialized hardware, routers, and switches. Data and compute were coupled. It became unreliable. They spent hundreds of hours moving data around, including through sharding, just to keep the system running. They couldn’t get hardware on demand, and Oracle licensing was expensive. They needed to find a better solution and quickly. Coming from a traditional RDBMS, they needed to evaluate big data use cases like those their customers were beginning to use.
They chose to move to AWS solutions DynamoDB (no SQL), Aurora, and Kinesis.
They ran the legacy and new data warehouses in parallel during the transition. Loads were modified, so that both the legacy and new warehouses ran in parallel for an extended period of time.
While the project was daunting, it was also highly successful. They pulled it off in two years. No project-wide Gantt chart was used. I was particularly impressed by the organizational dynamics discussed in the session. They had 90% successful query conversion on the first pass, which left 10%, that was clearly the presenter’s favorite part of the talk. We’re talking about all the people who didn’t want to migrate, didn’t have time to migrate, or were looking for features not in the new system.
The coordinating team didn’t get escalation emails. Rather, they got brag emails. “Hey, VP! We’re done early! Come to our launch party.” One team said they couldn’t hit deadline. When asked how much additional time they needed, they said a week. (Let alone the fact that a quarter of buffer had been built into the plan.) That’s not how tech is supposed to work. So, the central team asked, “How did you do it? You had said XYZ would break you.” The answer: “Yeah, well, we solved it. Don’t worry about it. And we shared the solution with other teams. It’s working great!”
The presenter maintained that central IT used to be the bottleneck. It was always a lot of blood, sweat, tears, and risk. But instead, Amazon teams troubleshot their own way without coming to the central guidance team because AWS technology makes that possible. That was a complete revelation to the central team.
The bottom line message: be ready. “The cloud moves far more quickly than you can.”
This was a data warehouse ecosystem with 1,700 different teams publishing, 3,000 teams consuming, and 20,000 data sets in active use. Policies and controls were put in place six months before cut-over to prohibit the introduction of new workloads into the legacy system.
Amazon.com Unloads its ~7,500 Non-Bundled Oracle Databases
The story of how Amazon transitioned from Oracle Database to AWS tools was split across two sessions:
- DAT359 ‘How Amazon.com migrated its applications from Oracle to AWS databases’
- AMZ301 ‘Amazon.com: enterprise database migration at scale’
Paramount in the case studies was the experience of Amazon.com replatforming their ~7,500 Oracle databases not required by third party applications, to other AWS database technologies. (I use the term ‘bundled’ in this context to refer only to a third party application vendors that requires an Oracle database. As such, I am not referring to bundling from a licensing perspective.) In August 2018, Amazon.com announced that this project was underway with a goal of completing the move by early 2020. Oracle’s Larry Ellison wished them luck, noting they had attempted the Oracle database unload previously, but had failed, and what hard work it was. AWS Professional Services operative Doug Booth, Principal Business Development Manager for AWS, described Amazon.com as Oracle’s largest customer many times over, which is certainly credible. Amazon was experiencing substantial challenges with scalability and uptime.
Fast forward to October 15, 2019. See the one minute video documenting the last Oracle database shutdown and celebration in this project overview blog post.
- Cost savings: 90%
- Performance/throughput improvements: 40%
- Substantial uptime improvements
- Scale Up/Scale Down (as opposed to scale up and stay scaled): introduced for the first time (think Black Friday, and now even more importantly, Prime Day.)
In conversation after the third of these sessions, more information came to light. Upon co-presenter Thomas Park’s July 2016 arrival at Amazon.com, people sincerely told him he’d have to hire a thousand people to get it done. Rather, he went in search of hungry Amazon professionals to provide leadership in the effort. Oracle Database administrators expressed concerns for both project viability and their professional futures. Individual business units and workflows were given the architectural option to choose which of the six AWS database technologies (as well as other AWS technologies) would best suit their purposes. Keep in mind that this effort succeeded without any central technical triage or escalation team.
Now all of the former Oracle Database administrators are happily employed in and out of Amazon.com as transitional mentor architects and in similar professionally invigorating roles.
I asked co-presenter Thomas Park, Sr. Manager of Software Development for Amazon, if they were on an Unlimited License Amendment (ULA) with Oracle. “I can’t comment on Oracle licensing,” he said with a smile. I expected his answer, but I still had to ask. “Then let’s discuss something you can talk about,” I said. “What about refactoring to deal with Oracle’s supplied PL/SQL packages? For many enterprises, that’s the 800 lb. gorilla in the room of unloading Oracle Databases.” Thomas said that each of the workflows evaluated how to approach the business and functional need with available AWS database types and tools. Through that process, he said the supplied packages issue became moot. This could give the impression of being both a non-intuitive and an unintentional circumvention of porting PL/SQL directly into PostgreSQL. But it could also be thought of as being something other than the Minimum Viable Product approach that is so common and so successful in such refactoring.
With that, Amazon.com had moved away from Oracle’s multi-purpose RDBMS to multiple, largely single-purpose database technologies—the very AWS technology direction that Ellison bashed repeatedly in his Oracle Open World 2019 keynote. I would think if the massive Amazon.com was indeed driving square technology pegs into round holes, their attempt to spin the comparative merits of the solution would become obvious soon, if it wasn’t already. Such transparency would be increased by individual business units and workflows’ enjoyment of architectural liberty.
The last of the three Amazon.com get-off-Oracle sessions had finished at 5:35 pm. Our group of ten or so who lingered longer, including the presenters, were still discussing it 40 minutes later. Presenters Doug and Thomas were even more lively and animated with attendees off stage. They were clearly genuinely interested in conference goers as individuals, as well as their organization’s specific challenges and opportunities. I appreciated this gift.
Does Amazon.com have a secret IT sauce to pull this off given their scale? That’s not the way the story reads, given the departmental architectural independence. Rather, mix that independence with their scale and one could imagine an exponential rise in project risk. Rather, I’m inclined to think that we are looking at one of the world’s most intriguing master classes in IT organizational behavior and transformation. There appears to be a wealth of public-facing, approachable, detailed information on this remarkable accomplishment.
Don’t forget, we’re not talking about a reference customer with a carefully-maintained relationship. This is Amazon.com.
 I haven’t asked presenter Doug Booth who made the claim how Oracle customer is defined.