Einstein Analytics ELT Orchestration in Node-Red

Posted on January 7, 2020March 26, 2021 by varmehta

Fellow Trailblazers,

In this blog, I am focusing on how we can build “EA ELT Jobs” orchestration (dataflows & data jobs) via EA APIs. Additionally, my focus is to leverage more declarative side to execute my use case. Feel free to try this out and let me know your feedback.

Background :

Making the right decisions based on perfect insights isn’t always simple. At times it can be tricky enough just trying to get the data you need from a variety of different data sources on time for your EA assets. Data overload is a fact that many enterprises are currently struggling with. The need and want for more and more data can get in the way of why the data is important in the first place: Providing clear, reliable, and actionable insights to make important business decisions.

Scheduling multiple dataflow jobs for EA implementation could be tricky if you didn’t know how long that ELT job takes. We have a defined governor limit of dataflow concurrent runs in our instances which puts a great challenge to serve the needs of every end-user from different verticals, and even more when you are dealing with complex high volume processes in your implementation.

This is where dataflow orchestration comes in. It allows you to control the sequence of execution, and to make the most out of dataflow modularisation and reuse. Currently, the platform itself offers limited functionality for orchestration, the good news is we can use the EA API’s and external tools to orchestrate our dataflow. And It is always a good practice to limit your multiple extracts of the same information within EA dataflow when there is no functional difference.

In this article, I would like to focus on how we can tackle those edge use-cases leveraging our EA APIs stacked with other ext. applications to built these orchestrations. I would further try to extend this use-case with my data cleansing process in order to optimize my EA datasets invoking our data APIs within the same process framework. And this could be an interesting topic for those working as an EA Developer, EA Architects, on Advisory roles in the Salesforce ecosystem.

EA ELT Orchestration: What Is it, Why Is it Important

More often, orchestration is the term we mean when we refer to automating a lot of things at once. It takes advantage of multiple tasks that are automated in order to automatically execute a larger workflow or process.

Well optimized Extract, load, and transform (ELT) job operations form the baseline of any successful Einstein Analytics implementation. It transforms operational & raw data into useful datasets and, ultimately, into actionable insights. As a result, more valuable data insights are more frequently available to our end users for self-exploration, or they can further slice & dice to full fill his/her reporting business needs.

EA ELT Jobs scheduling & automation is another area that is evolving rapidly in the product. If we look at our last few releases there has been a massive enhancement from the product side into this area. More time frequencies & granularities have been added to run our dataflow jobs (Time-based). Additionally, we can add event-based dataflow jobs that trigger our data sync jobs prior to every run. This helps us to keep our EA apps (Datasets/Dashboards/Lenses) up to date. But still, there are some areas where we still have gaps in the product, such as where you want to orchestrate your dataflow execution based upon another dataflow job outcome. Or have better control over our ELT job execution.

What is Node-Red :

It is a flow-based development tool for visual programming and for wiring together hardware devices, APIs and online services as part of the Internet of Things. Node is in the name purely because it is built on top of Node.js and takes advantage of the huge node module ecosystem to provide a tool that is capable of integrating many different systems. Feel free to read more -> Link

My Use-case :

We have Sales and Service EA applications relying on the same kind of information such as Accounts/Contacts/User Data that has already had some processing but the user can not be met with data sync alone. Instead, in which order those interdependent business processes are sequenced & calculated to feed data to our datasets are also essential.

For small volume use cases, we might tackle and scheduling more frequent dataflow job runs to cater to our needs in a single day. But this can get more challenging when we have an individual dataflow and sub dataflow ingesting and transforming hundreds of millions of rows simultaneously on the instance and has some interdependencies on the dataset outcomes to another.

Key Questions :

Dataflow Orchestration: What if, you have a resulting dataset from one dataflow which is essential to provide inputs to your second dataflow job. Considering both jobs are high volume jobs on your EA instance. How would you handle the order of execution in your scheduling to avoid data conflicts & skimming?
Data Cleansing Process: Every resulting dataset stored in our EA instance is counted against our EA overall global storage utilization. As EA does not have any ‘Staging’ area, therefore you might have raw datasets floating around which is of no use post-transformation of your datasets in your application. How would you incorporate this kind of data cleansing process along with your Dataflow orchestration?

Above mentioned scenarios, I am going to touch upon in this blog leveraging EA dataflow & data APIs on Node-RED application. I am hosting my Node-RED application on the Heroku platform. But if you like you can also install Node-Red locally or choose any platform to host your app to achieve the outcome.

Don’t forget to join EA Success Community :

I would highly recommend you all to join our EA Camp series or EA success community to learn more about the EA integration best practices from our product evangelists and COE team members. Additionally, you can always share your great ideas and your current use cases on this forum.

And before we deep dive, I would also like to thank my dear colleagues from Salesforce Ohana Wouter Trumpie & Rodolphe Lezennec for sharing their great innovative insights to shape up this POC.

Lets Deep Dive –

1) Use-case Key Components :

Below are the key components I have used for my use-case

Node-Red Application
Heroku Platform
Salesforce Dev Instance

Note – You can use the pre-built wrapper for deploying Node-RED into the Heroku – Link

2) Scheduling Orchestration on Node-Red using EA DataFlow APIs :

– Demo Video :

Description: In the above demo I was able to invoke EA dataflow APIs and sequenced my jobs (My EA Dataflow -> 1 & My EA Dataflow -> 2) one after the other based upon the job status of my dataflow jobs. This can be easily extended with more nested flows based upon your rules and can be equipped with more customized exception handling if needed for your use-case.

– Architecture Overview :

Below is the dataflow API architecture overview for the above demo –

3) Dataflow & Data Cleansing Orchestration on Node-Red using EA APIs :

– Demo Video :

Description: In the above demo I was able to invoke EA dataflow & Data APIs with my process sequence. At the end of my process flow, I was able to delete rows from my first raw registered dataset (ext_raw_dataset_1) on my instance. This can also be easily extended with more nested flows based upon our rules and can be equipped with more customized exception handling if needed for our use-case.

– Architecture Overview :

Below is the EA Data API architecture overview part for the above demo.

In the second part of my use-case in the dataset cleaning process, I am leveraging our EA external data APIs to perform overwrite and delete operations. Below are the relevant steps associated with this process and some key terms. To deep dive into the data APIs, feel free to explore the public documentation – Link

Step 1 → I am overwriting my external first dataset with one dummy row. Which resulted as an outcome from my first dataflow
Step 2 → In the next step, I am initiating my delete operation to delete my dummy row from that dataset. Or cleaning up my dataset rows.

a) The InsightsExternalData Object: This object provides a “header” that contains information about my upload, such as the name of my dataset, the format of my data (Ex: CSV), and the operation to perform on the data. I am also providing the metadata file. It is used with the InsightsExternalDataPart object, which holds the parts of my data to be uploaded.

b) The InsightsExternalDataPart Object: This object works with the InsightsExternalData object. After you insert a row into the InsightsExternalData object, you can create part objects to split up your data into parts. In my case, I am only uploading my dummy datafile in base64 encoded to overwrite my current data with one row. Then following the same steps to delete my data. What is important for the deletion process that your data should have a unique identifier.

c) MetaData JSON File: A metadata file is a JSON file that describes the structure of my external data. We can define the schema of our data.

d) Edgemart Container: It is the APP (for e.g. SharedApp) that owns/keeps the dataset.

4) Implementation Key Steps :

Step 1 → Configure Your Connected App with OAuth Authorization Settings. To Learn more Link
Step 2 → Sign up for free Heroku Platform Account. To Learn more Link
Step 3 → Deploy the pre-configured Node-Red wrapper on Heroku. To Learn More Link
Step 4 → Set up OAuth 2.0 Authentication flow in Node-Red. To Learn More about this node Link
Step 5 → Invoking EA REST Dataflow APIs to initiate, sequenced & Monitor your Job. To Learn more Link
Step 6 → Invoking EA External Data APIs to perform operations on your datasets (ex in my use case: Overwrite & Delete). To Learn more Link

(Note: To learn more about Node-Red with few sample tutorials and cookbooks. Link)

5) Implementation Process Flow Overview (Modularisation) :

Don’t reinvent the wheel – It is always a good practice when we have a common function, create a single workflow “class” that can be “instantiated” as frequently as required, yet maintained only once. Instead of creating multiple versions of a service, use modules, variables or parameters that can accommodate the variety.

To simplify my process flow complexity. I have modularised my core business logic into several sub-flows and bind them with individual service modules. By doing so I am able to

Manage & maintain my process flow in a more convenient manner
I can easily reuse any sub-flow and service blocks with my other use-cases.
I can maintain quality and extend my blocks’ features faster.
I can debug faster in case of any exception odd use-cases occurs
Most importantly, anyone from my team can deep dive and collaborate into this faster.

6) API Explorer :

There are several API tools that help you explore various APIs interactively, see methods available for each API and what parameters they support along with inline documentation. Below are some of the tools you could use to test your EA API endpoints –

Postman – (Link): Postman is a collaboration platform for API development. It features simplify each step of building an API and streamline collaboration so you can create better APIs—faster.
Salesforce Workbench -(Link): A web-based suite of tools designed for administrators and developers to interact with Salesforce.com organizations via the Force.com APIs.

7) EA REST APIs End Points :

Below are the relevant API endpoints I will be using to build this use case.

/services/data/v46.0/wave/dataflowjobs → Start your Dataflow
/services/data/v46.0/wave/dataflowjobs/dataflowId → Retrieve your Dataflow Job & Status
/services/data/v46.0/sobjects/InsightsExternalData → Dataset Upload/Operation Information
/services/data/v46.0/sobjects/InsightsExternalDatapart → Upload your Datapart
/services/data/v46.0/sobjects/InsightsExternalData/DataID → Initiate your Upload

8) API Explore via Salesforce Workbench REST explorer :

It is very essential to review our EA REST call-outs and return responses beforehand. Or make sure that it is complying with Salesforce defined API contract and platform governor limits. Below are some of the sample workbench API snapshots relevant for our use-case.

(Disclaimer: Workbench is free to use, but is not an official salesforce.com product. Workbench has not been officially tested or documented. salesforce.com support is not available for Workbench. Please use it at your discretion)

A) – Start your Dataflow

/services/data/v46.0/wave/dataflowjobs/dataflow_id

Method – POST

B) – Retrieve your Dataflowjobs (Status)

/services/data/v46.0/wave/dataflowjobs/yourjob_id

Method – GET

C) Configure our Dataset

/services/data/v46.0/sobjects/InsightsExternalData

Method – POST

D) Data Chunking & Part Upload

/services/data/v46.0/sobjects/InsightsExternalDataPart

Method – POST

E) Start the Upload

/services/data/v46.0/sobjects/InsightsExternalData/id_of_InsightsExternal

Method – PATCH

F) Monitor Your Upload

/services/data/v46.0/sobjects/InsightsExternalData/id_of_InsightsExternal

Method – GET

9) Other Key Considerations :

Your dataset metadata schema should be consistent. Keep it safe when you create your dataset for the first time.
Delete Operation: The rows to delete must contain one (and only one) field with a unique identifier. Make sure to flag the uniqueness in your schema file for that field.
The metadata JSON & CSV data file values should be a base64 encoded string. You can use this free tool to encode into base64 – Link
Make sure your design will respect the defined EA & SFDC API governor limits. To learn more – Link

10) Conclusion :

In the above article, I have covered two very generic scenarios. First to orchestrate my dataflows via EA dataflow APIs. And second to further invoke EA Data APIs in the same process flow to cleanse my non-usable datasets.
I have used the fairly straightforward sample use-case to depict the power of EA APIs. And this can be easily extended for your specific use cases if needed.
In such use cases where you have clarity on data updates and ext. uploads time frequencies. Then further taking control over ELT job executions can be another value-added attribute to streamline your EA processes and scheduling strategies (Peak Vs Off-Peak hours) for your implementation.
Lastly, you can use any platform or application as a canvas to reciprocate the same EA orchestrations based upon your use-case leveraging EA platform APIs.

For the new audience on this topic, I hope it gave you a fair idea about ELT job orchestrations via EA APIs on the platform. Let me know your feedback and comments. I hope this helps.

Cheers!
Varun

Blogs

Einstein Analytics ELT Orchestration in Node-Red

Background :

EA ELT Orchestration: What Is it, Why Is it Important

What is Node-Red :

My Use-case :

Key Questions :

Don’t forget to join EA Success Community :

Lets Deep Dive –

1) Use-case Key Components :

2) Scheduling Orchestration on Node-Red using EA DataFlow APIs :

– Demo Video :

– Architecture Overview :

3) Dataflow & Data Cleansing Orchestration on Node-Red using EA APIs :

– Demo Video :

– Architecture Overview :

4) Implementation Key Steps :

5) Implementation Process Flow Overview (Modularisation) :

6) API Explorer :

7) EA REST APIs End Points :

8) API Explore via Salesforce Workbench REST explorer :

A) – Start your Dataflow

B) – Retrieve your Dataflowjobs (Status)

C) Configure our Dataset

D) Data Chunking & Part Upload

E) Start the Upload

F) Monitor Your Upload

9) Other Key Considerations :

10) Conclusion :

Leave a Comment Cancel reply