SAP HANA Cloud Data Lake and Multi-Cloud Support for Google Cloud Platform – Part 3
Prompted by a customer query, I allocated sometime to understand the integration between SAP HANA Cloud and Google Cloud Platform. I am pleased to be able to share my findings. This blog builds on the series of blogs I have already published on SAP HANA Cloud and intend to develop further.
In my first blog, I shared my initial impressions of SAP HANA Cloud and catalogued my adoption of the HDI development paradigm. This sets out the new tooling and capabilities of this platform.
In the follow-up blog, I made good on my promise to provide a similar appraisal of the Data Lake capabilities. This provided an appreciation of the architectural underpinning of SAP HANA Cloud Data Lake and define its relationship with SAP HANA Cloud. The blog steps through a scenario involving data loads from AWS S3 via the CLI.
In my last blog, I shared a custom solution to support the mass load of CSV files. The LOAD TABLE statement is unable to cycle through a large number of files (~1700 in my example) residing in S3. To overcome this, I have developed a custom solution. Once executed, it will cycle through your S3 contents and load this into the SAP HANA Cloud Data Lake – irrespective of how many files exist in a given situation.
In this piece, I would like to cover the integration with Google Cloud Platform (GCP) and specifically, Google BigQuery. I want to take this opportunity to reinforce the popularity of the Multi Cloud strategy. I firmly believe that a significant business opportunity lies in leveraging datasets that have historically been considered “external” and to a large extent, irrelevant. This could include data from social media, customers, suppliers, industry, etc. This opens up a new frontier for so many businesses.
The Cloud offers cheaper, powerful and hugely scalable data storage and processing platforms. Whilst customers have adopted the Cloud in their droves, this has invariably consisted of multiple different vendors. My previous blogs have focussed on SAP and the integration with AWS. In this blog, I will demonstrate how we can extend this integration to include GCP.
My earlier blogs catalogue the integration with AWS. Here they are again:
- SAP HANA Cloud in Conjunction with the “New” Development Approach
- SAP HANA Cloud Data Lake – Part 1
- SAP HANA Cloud Data Lake – Part 2
In the remainder of this blog, I will step you through the integration with Google BigQuery. BigQuery is a part of GCP and is Google’s serverless, highly scalable Data Warehousing solution. The BigQuery option can be seen in the list of services available within GCP.
I will now proceed to load some data….ummmm…let’s go with another Cricket dataset. T20 Cricket Internationals on this occasion. Once loaded, I will be harnessing this data along with the AWS data ingested into SAP HANA Cloud Data Lake within SAP HANA Cloud.
As a next step, I will launch BigQuery and create the necessary objects.
My project named “am-t20s” requires a Dataset and once the dataset has been created, I am ready to import data. In this example, I will be loading data from my Google Drive by providing the URI.
Once complete, we can query this table using the SQL Editor in BigQuery.
Since I auto generated the schema and this file did not contain a header row, we have ended up with spurious field names! This could be tidied up by renaming the fields or alternatively, creating a Database View. Here is an example of how you may approach this as part of a View.
This completes the BigQuery data preparation steps.
We will now move on to the integration with SAP HANA Cloud. This comprises of the following steps:
- Create a Service Account in Google BigQuery
- This account requires Read access to the objects created in the preceding steps
- A Private Key is required to support communications via the BigQuery Rest API. This is generated in JSON format
- Download and import the Google Certificate into the SAP HANA Cloud Certificate Store to create HTTP Trust
- Create Remote Source in SAP HANA Cloud
Once these steps have been performed, we are ready to access data hosted in GCP via Smart Data Access in SAP HANA Cloud. This uses the BigQuery rest API service to facilitate communications.
I have introduced and explained the use of Virtual Tables in my previous blog.
Whilst this allowed me to create the new Virtual Table, I was unable to execute this due to lack of privileges. I traced this back to my Google Drive and discovered that an “External Table” in BigQuery does not contain the data and instead, retrieves it from the source i.e. Google Drive. My Service Account did not have these permissions. After correcting this, I was able to query my data in SAP HANA Cloud.
This Virtual Table in SAP HANA Cloud is retrieving datasets that are hosted in Google BigQuery. Finally, I will create a View that combines this data with data that resides in the SAP HANA Cloud Data Lake and has been retrieved from AWS. The AWS integration has been documented in my previous blog.
In summary, this blog provides a practical approach to a Multi Cloud strategy. This scenario demonstrates how SAP HANA Cloud can interface with different Hyperscalers and allow organisations to federate, transform and combine large and complex datasets to deliver valuable insights. Once these integration points are in place, we can leverage the strength of the platform to analyse, predict and model this data to help anticipate business outcomes. More on these modern Data Warehousing principles in future blogs.
As always, please get in touch if there are any questions or insights that you wanted to share. Can I also take this opportunity to wish you all a very safe and pleasant Christmas break.