Implementing Cloud Computing for Data Analytics
Implementing Cloud Computing for Data Analytics
Data analytics is composed of data and analytics on the data. Data is varied, generated by internal sources (like transactional applications, sensors etc) and/or sourced from external sources (data aggregators, DMPs, SaaS applications etc). As more internal applications are moving to the cloud, this internal data is invariably already present on the cloud platform. Organizations spend less effort and people managing the data infrastructure in the cloud environment while getting better scalability, if security challenges are addressed properly. For external sources, the process to get into a hosted or cloud environment is essentially the same in terms of complexity. It can be argued that getting data into the cloud environment is a little simpler due to simpler provisioning. Even though computing and storage are decoupled in the cloud, it is very easy to build a solution that uses scalable compute and storage to message and munge this data from different sources and combine into useable stores. Teradata, a stalwart in data management, also has AWS instances to allow it to scale more.
"Many organizations use data only if it supports their gut or intuition. If it doesn’t, then insight from analytics is either ignored or belittled"
On the analytics front, key activities are data aggregation, machine learning and statistical analysis and visualization. Building machine learning models that can be trained efficiently is dependent on quick access to large compute capacity for short periods of time. Cloud computing is well equipped to provide such capacity models. Many cloud providers provide specific machine types with pre-installed libraries to speed up deployment. Further, services like AWS EMR take the server provisioning completely out of the picture and let the data scientist focus on model training and development tasks.
However, deep learning based models have thus far been not very successful in the cloud computing model for two reasons:
(a) Availability of powerful GPU based compute infrastructure
(b) Distributed deep learning software libraries to take advantage of scaling in the cloud
At recent AWS: reinvent, new announcements around new and powerful GPU compute instances along with support for distributed deep learning libraries like MxNet and TensorFlow was announced. Expect to see deep learning work move to the cloud for a number for several companies.
In the visualization area, general web application build outs are well known. Using serverless computation platforms like AWS Quick sight are emerging, which can bring a round of innovation. In summary, a lot of reasons to be excited about use of cloud for data analytics!
There are different types of challenges companies face while deploying data analytics solutions. First challenge is cultural. Having a data driven decision making culture is essential to the success of data analytics initiative. Many organizations use data only if it supports their gut or intuition. If it doesn’t, then insight from analytics is either ignored or belittled. This is a critical question for companies before they embark on their data analytics journey. Organizations must clearly define expected outcomes from data analytics project. Is it speed, quality, or availability that they are after? It is often impossible to do all three at the same time. Often, all three are attempted at the same time, leading to missed expectations. It is important to start small and make progress. Often, two small consecutive steps have a multiplicative effect rather an additive effect.
Second challenge is spending too much time building infrastructure, which is often a data warehouse or a data lake, and not delivering value quickly. Data warehousing projects often spend a lot of time designing the right schemas and ETL to get the data in one place. By the time these projects are completed (2-3 years), business applications have changed. It is better to think in terms of ELT (extract-load-transform), and apply the schema on read. Starting with a line of business, delivering analytics and insights and slowly expanding to other lines of businesses can be a more sustainable approach.
Third challenge more recently has been hiring data scientists without having a clear outcome in mind. To get value from data scientists, one must start with a set of hypotheses that data scientists prove or disprove. These hypotheses should be generated by people with good domain knowledge. Generally, data scientists are not well versed with the business or domain. Having a structure that fosters collaboration with data scientists, and business operations and strategy people is very important for success of initiatives.
The CIO as the “Strategy Officer”
For technology executives, it is important to consider the role differences between a CIO, CTO and CDO. We are seeing an emergence of Chief Data Office (CDO) as importance of data is more widely recognized. More and more business operations are being driven by technology. Consider the impact of IoT on manufacturing businesses. Smart plant type initiatives would become mainstream in a few years. In the marketing world, a lot of technology related spending is being done by the CMO. Marketing moves at a different speed than operations/ manufacturing. This often creates an issue with expectations and leads to silos and fragmentation of the enterprise architecture.
CIOs need to prepare their organizational structure such that it can move at different speeds for different needs. For marketing needs, fast responsive time is needed. Time to market is critical. In some infrastructure type projects, it is important to proceed with lot of thought and consideration. Managing an organization where different projects and people move at different speeds is critical.
CIOs have a big role to play in driving innovation in the enterprise. It requires recasting their relationship with the lines of business as more of a partner than a service provider or cost center. CIOs must consider initiatives, in partnership with lines of business that try new ways of working, targeted proof of concepts and investigate new technologies and ways of working.
Finally, CIOs need to consider how they approach the data asset. Thinking must evolve to consider data as an asset rather than liability. Once this way of thinking is adopted, then the question CIOs must answer is how they are going to extract value from this data asset. Can this data asset be used to create new sources of revenue? Can this data asset be used in new ways to improve business operations? In short, finding ways of impacting the top and bottom lines using data assets is the CIOs imperative and opportunity.