Towards cloud-native Big Data Analytics: the role of Kubernetes.
In recent years, the “Big Data” analysis has become a necessity for many enterprises, not only for Scientific Research; however, identifying the ideal architecture for large scale data analysis is not a simple task: to date, no solution, open-source or not, is perceived as the “de-facto standard” of the market, regardless of the specific use case.
On October 15th 2013, the day of the “General Availability” of the first version of Apache Hadoop 2, the situation was very different: the challenge of Big Data Analytics seemed won… only large-scale adoption by the Apache Hadoop’s ecosystem had to be awaited, and extracting value from the analysis of huge amounts of data would no longer be a problem, thanks to the MapReduce programming paradigm. At that time, the conditions for rapid adoption of the Apache Foundation’s Data Architecture were very favorable: IT Managers could not only count on the open-source version of a product considered complete, scalable and cheap, but multinational companies such as Cloudera, Hortonworks and MapR considered Apache Hadoop’s ecosystem at the center of their business; these companies employed their technicians to evolve and “customize” the open-source product and to provide all the services and consultancy necessary to simplify its adoption in the Enterprise environment.
A little less than 7 years have passed since those days, but many many things have happened in the meantime: the expected large-scale adoption of Apache Hadoop did not take place, despite the fact that version 3 was released at the end of 2017. Also, the enterprise-class support has substantially changed: only Cloudera Inc., which in January 2019 acquired Hortonworks, stayed alive, while, in August 2019, MapR was acquired by HPE. In addition, some architectures for large-scale data analysis alternative to the Apache Foundation’s approach are gathering a growing consensus from the main influencers in the sector: the Apache Hadoop ecosystem is no longer perceived as the infrastructure of choice to support Big Data Analysis!
But how did we get to this “perception”, starting from the expectations of October 2013? Some observatories argue that the Apache Hadoop’s ecosystem is simply no longer “fashionable” as it was in 2013, but that it still represents an excellent solution for large-scale data analysis. That’s why several large companies around the world still use it with excellent results. The large-scale diffusion of the architecture of the Apache Foundation, according to this “school of thought” probably did not exist because there were not so many organizations that had a real need to process “Big Data”! Others argue, however, that the expected growth in the adoption of the Apache Hadoop ecosystem has not occurred because this architecture has not kept its promise to scale simply, efficiently and, above all, at low costs; for these reasons, the Apache Foundation’s Data Archicteture is now outdated, very close to being one of those legacy solutions that remain alive for some time only because huge investments have been made in the development of applications that use it and which, in the medium term, is going to die!
Two opposite positions, very clean, which suggest more a “religious war” rather than a rational analysis on the topic of enabling architectures for the analysis of Big Data. From our point of view, some components of the Apache Hadoop’s ecosystem are “alive and well” (think for example of Spark and Presto, just to name a few) but its overall architecture shows a peculiarity, which has significantly reduced the ability to scale easily and at a reasonable cost in many contexts: the core of Apache Hadoop is designed with a strong mix of compute and storage components, with the result that, to efficiently analyze large amounts of data, it is necessary to have the know-how of the Data Scientist, together with the support of professionals able to fine-tune the HDFS storage subsystem, on-demand, so that it does not result a bottleneck for the entire analysis process. It is not so surprising, therefore, that only large organizations have been able to create the right context to use the Apache Hadoop’s ecosystem for scalable Big Data Analytics: to take advantage of the Apache Foundation’s Data Architecture on-premise, huge investments are required, in order to have a dedicated monolithic cluster that hosts the architecture. Moreover, to guarantee results, the Data Scientist Team must be supported by personnel with a solid knowledge of HDFS and YARN internals, throughout the infrastructure life cycle. Perhaps it is true that not all organizations have a real need to analyze Big Data, but it is also true that the complexity of managing the Apache Hadoop’s ecosystem makes it a Data Architecture for the few (…and very rich).
A confirmation that the complexity of the Hadoop has been a problem can be found by “reading” the architectural choices made by the Apache Foundation itself on the main second generation components. For example Apache Spark: it is an environment for distributed computing oriented to data analysis designed to work not only within the Apache Hadoop ecosystem, but also in distributed environments without HDFS and YARN, thanks to the possibility of operate natively on Object Storage with S3 interface and use alternative scheduling services with respect to YARN. It is certainly true that Apache Spark’s constantly growing adoption curve is mainly due to its in-memory analysis capabilities, which have multiplied its performance compared to Apache Hadoop MapReduce, even by a couple of order of magnitude. But the independence from the core of the Hadoop ecosystem (HDFS and YARN) laid the groundwork for what is today the most typical scenario of “first adoption” of enabling platforms for large-scale data analysis: the use of resources in the cloud. For all those companies that want to approach Data Analysis on a large scale avoiding large initial investments, thanks to the “light design” of Apache Spark, it is possible to purchase Object Storage on the cloud to host the data to be analyzed and use a pool of computing resources, for example orchestrated by Kubernetes, to run large-scale data analysis workloads using Spark. Of course, this approach to large-scale data analysis can also be efficiently implemented on-premise, not just in the cloud. Thus, Spark’s independence from YARN and HDFS has made it a flexible, scalable, elastic and cost-effective platform for data analysis.
The partial scalability of access to storage and the excessive complexity of some of its core components were not the only reasons that limited the large-scale diffusion of the Apache Hadoop ecosystem: in recent years, Machine and Deep Learning technologies have become the main tool to get insights from large amounts of data and the most interesting frameworks for machine learning (scikit-learn, Tensorflow, Pytorch, Fast.ai, ..), designed for Python-based development environments, are not yet integrated efficiently with the Apache Foundation’s Data Architecture. These frameworks will be integrated with it maybe only when tools such as Apache Arrow, a platform for cross-language development of in-memory analytics, will become the element of integration between all components of Data Architecture.
Finally, it is reasonable to believe that the evolution of architectural paradigms, following the Container revolution, has contributed to making the Hadoop’s ecosystem “unfashionable”. The strong coupling between compute, scheduling and storage resources makes the applications developed for the Hadoop ecosystem not very cloud-native, a paradigm for which the elasticity of development and deployment in different environments – and the consequent abstraction from details such as compute, storage and network – represents a fundamental requirement.
In this scenario, what are the choices to be made to have an efficient, scalable, flexible, economical Data Architecture capable of hosting the best components useful in all phases of the data life-cycle? Our reference design provides, first of all, the adoption of a Kubernetes cluster, configured so that it can benefit from an internal high-performance network (RDMA), to be used both for inter-pod communication and for access to distributed storage resources. The Kubernetes platform allows the deployment of a high-performance distributed Object Storage (based, for example on Minio), supports data analysis engines such as Spark, but also emerging solutions such as Dask, Ray or Vaex, simply integrates GPU computing and is the ideal platform for end-to-end Data Science environments such as Kubeflow.
Next time we will explore Kubernetes as an enabling platform for large-scale Data Analysis. Stay tuned!
Useful links: