SAP HANA Cloud Data Lake – Part 1
In my last blog, I shared my initial impressions of SAP HANA Cloud and catalogued my adoption of the HDI development paradigm. This sets out the new tooling and capabilities of this platform. I signed off by promising a similar appraisal of the Data Lake capabilities. With a day left before I break for some holiday, I wanted to make good on this promise.
This blog aims to share my experiences and learnings of SAP HANA Cloud Data Lake. I’ve been working on a fairly extensive and exciting scenario that I thought would be helpful to split across a few blogs. In this piece, I will introduce you to SAP HANA Cloud Data Lake and the scenario I have concocted. I will build upon this in subsequent posts.
SAP HANA Cloud Data Lake is currently available as an optional add-on for SAP HANA Cloud and has been Generally Available for a few months. It was commonly referred to as SAP HANA Cloud Relational Data Lake around the time of its inception but I’ve noticed that the term “Relational” has since been dropped – or at least much rarely encountered. As I will come on to explain, the solution is based on a relational storage platform but it can hold non-relational data e.g. BLOB’s. When provisioning the SAP HANA Cloud system, there is the option to include a Data Lake instance or alternatively, this can be activated at a later date. You can see my SAP HANA Cloud and Data Lake systems in the screenshot below.
SAP HANA Cloud is founded on the SAP IQ (formerly Sybase) technology and specifically the Multiplex deployment. IQ is renowned for scale, distribution and performance. This is palpable in the use of its latest manifestation. In practice, you will need to determine an appropriate Data Tiering strategy which stores and processes the data to meet your requirements and specifically, query response requirements.
Like former versions of SAP HANA, you are able to vary storage between In-Memory and NSE. Additionally, we now have the option to store multiple Petabytes of data in the Data Lake. All of this is universally accessed from SAP HANA using SQL commands. An overview of these operations and the use of Database Explorer is covered in my previous blog.
The SAP HANA Cloud Data Lake system can be expanded and contracted based on your requirements and will be charged accordingly. This provides flexibility in how you respond to peaks and troughs in demands.
In my quest to examine this capability further, I stumbled upon a treasure trove of Cricket statistics. Pure happenstance…
I opted for data relating to One Day Internationals (ODI) and in totality, I have a CSV which represents every ODI that has ever been recorded. This amounts to approximately 1700 files and each file records every delivery that was bowled in that match e.g.
1,18.4,Australia,RT Ponting,DR Martyn,JJ van der Wath,4,0,””,””
In plain English:
- First innings18.4
- Batting Team is Australia
- Batsman is Ponting. The magnificent Ponting I feel compelled to add
- Non-Striker is Martyn
- Bowler is Van Der Wath
- Runs off the bat 4. Elegantly driven through covers I suspect
- Extras are 0
- No wickets
- No dismissed player
A lot of data…
In the ensuing sections, I will take you through the process of loading this data into SAP HANA Cloud Data Lake and once it has been successfully loaded, we will carry out some further processing and querying.
SAP recommends the use of Cloud Hyperscalers to perform high speed loads into the SAP Data Lake.
In this example, I have opted to use AWS S3 Buckets to host this data. After creating a Bucket, I created a subfolder named “ODI” and this was the new home for our data.
I initially opted for the GUI upload but experienced significant hardships. The upload of this data was taking an inordinately long time and repeatedly timing out before it completed the upload process. After some further research, I became aware of the S3 Command line Interface (CLI). I downloaded the CLI executable and beavered away in Windows Power Shell. The upload was unbelievably fast!
The CLI also supports a whole range of S3 administrative efforts e.g. moving files, renaming, etc. Many of these commands were invoked for the code I will take you through in a subsequent post. I would advise the use of this CLI if you are looking to load a substantial amount of data/files. Once this was complete, I needed to make some further changes in S3 to provision external access and crucially, generate the Access Key and Secret Access Key. This is required in the next step.
We are now ready to load this data into our SAP HANA Cloud Data Lake instance.
Firstly, let me take you through some key integration points across SAP HANA Cloud and the Data Lake instance. The following image displays the default connection to the Data Lake, “SYSRDL#CG_SOURCE”. “RDL” is presumably a remnant of the “Relational Data Lake” name that it had first adopted…
This features as a predefined Remote Source.
A closer examination of this remote source also reveals the IQ system providing this functionality. Note that IQ driver and connection parameters.
Running commands on your Data Lake instance requires code to be submitted through the Procedure named “SYSRDL#CG.REMOTE_EXECUTE”. This also requires the IQ SQL dialect. This caught me out a few times but Sybase IQ documentation is still widely available on the Internet.
Let’s put this to practice by creating a physical Table in the Data Lake.
Note that it’s also possible to modify Tables, create Views and Indexes using the above approach. You may want to do this to optimise data retrieval times
Once the Data Lake Table has been created, we are ready to populate this with data held in our CSV files. In the following example, I will use the optimised LOAD TABLE statement to ingest some data from the aforementioned S3 Bucket.
Note that the Access Key and Secret Access Key retrieved from AWS S3 must be cited here.
As a final step, we must now create a Virtual Table in SAP HANA to access this data. There is a GUI for this operation or alternatively, it can be generated through SQL commands. This Virtual Table supports bidirectional access. You can read from the Data Lake as per example below but you can also write to the Data Lake via this Table too.
We can now combine this with other datasets, incorporate into a HDI Project or create Calculation Views before we consume the resultant data in SAP Analytics Cloud. My previous blog describes these steps.
In closing, I hope this helps you appreciate the architectural underpinning of SAP HANA Cloud Data Lake and understand its relationship with SAP HANA Cloud. I’ve shared some best practices around loading large volumes of data through leveraging AWS S3 and its CLI. You can follow the above steps to load and work with any of your own datasets.
In my next post I want to share how, we can recursively load files – in my case ~1700 files – using a Stored Procedure I have developed. This recursive load isn’t supported by the LOAD TABLE statement and after spending many hours, I was unable to find any custom solutions. I look forward to finalising my code and sharing with you over the coming days.
As always, please get in touch if there are any questions or insights that you wanted to share.
If you haven’t signed up for #itelliFest2020 On Demand you can do it here