Einstein Analytics Data Acquisition & Heroku
In this blog, I am focusing one of the key area “Data Acquisition” on Einstein Analytics. For the demo purpose, I have decided to use Heroku platform (Heroku Connect & Heroku Postgres (SQL) database) which is tightly integrated with Salesforce . Also, my focus is ‘Point-and-Click’ actions without writing a single line of CODE how you can perform these actions precisely and efficiently. Feel free to try this out as we have free dev instances from Heroku available for everyone.
Einstein Analytics Data Acquisition:
Data Acquisition is one of the core factors to deliver successful Einstein Analytics projects. As everything is about connecting the dots in a fourth industrial revolution. Also, you don’t want to put fancy dashboards which are taking several minutes to load and filter. Therefore not only understanding the business needs or satisfying the product owners, but equally understanding the data, modeling, schema design & data preparation is well needed beforehand for the successful EA rollouts.
I am well emphasizing this at the very top as I had my nightmares on this topic with few implementations and have my lessons learned. Also if you want to dig deeper into these topics I would highly recommend following EA camp series. Check few sessions from Tim Bezold & Terence Wilson (EA COE-Salesforce).
“It’s difficult to imagine the power that you’re going to have when so many different sorts of data are available: – Tim Berners-Lee”
Key Considerations & Lessons Learned:
a) Focus to bring as much clean data into EA using these connectors. Do not throw heavy lifting tasks on data flows and recipes only to transform your data. Purpose of these tools is to perform small data string operations to help you prepare your data on smaller data volume. NOT when you are dealing with millions of heavy loads. Therefore, do not position this utility to construct the core data warehouse business logic or to replace any ETL products. Early alignment & engagement with data Integration team within your organization is very essential to have the right ETL strategy in place for your EA rollouts.
b) Keep your business logic as much close to the data source. By doing that you have full control on your data inflows and can debug faster in case of any conflicts. In addition, you have enough bandwidth to optimize further in your data flow and SAQL for your final datasets and dashboards.
c) Personally, I am not a big fan of data recipes. Imagine you are dealing with few hundreds of millions of rows. First, how much time do you think these operations will take to transform every single row based on your data operation? Then you really need to align your data flow and replication scheduling strategy based upon this transformed dataset, then later you have these datasets ready for you to ingest further in the final datasets for your dashboards.
Remember any overlap in scheduling might skim the data to the final dataset source and dashboards. And keep that in mind resulting dataset cannot effectively mutate itself to become optimized it has to be driven by the data supplied and it takes time to settle in order to see your SAQL queries runs faster on the platform. So having right scheduling strategy in place is one of the key factors for successful EA rollouts. I would recommend scheduling your flows (Replication & Data flows) in non-business hours in order to have your performance oriented EA dashboards for the end users. By the time all resulting digest nodes will be well settled for your end users to provide performant looking dashboards .
d) This is a more commercial aspect but of course very essential in regards to data storage when you are dealing with heavy loads. Let’s take an example you have 5 million rows in your outside environment and you are pushing via ETL into Einstein Analytics from heterogeneous sources. Then you start your data operations using recipes then create a new ‘Dataset’ with 5.1 million rows. Then you may or may not use data flows if you are then you will register these 5.1 million rows again with more data 5.2 million rows by adding SFDC attributes to your rows. So in the end, you will be consuming :
- 5 million Rows: Initial External Load – Dataset 1
- Data source 1 ->Transform-> 1 million Rows: Post recipes resulting Dataset 2
- Data source 2 ->Transform-> 2 million Rows: Post data flow resulting Dataset 3
Total Storage: Dataset 1 (5) + Dataset 2 (5.1) + Dataset 3 (5.2) —-> 15.3 million rows
Dataset 2 can be clear saving if you have the right ETL strategy in place. Not only the storage but also this will add on to the performance aspect of your dashboards to do less heavy lifting within EA and control your data quality and conflicts close to the source itself. Therefore, think ahead before designing your data acquisitions strategy for EA implementations. Compare the features best fit to your needs.
Einstein Analytics & Heroku:
Below are the key components I am using for my use case :
- Salesforce Single Org – (Could be Single or Multi Orgs)
- Heroku Connect
- Heroku Postgres SQL Database
- Einstein Analytics (Reporting Org) – (Could be Same or any other standalone Reporting Org)
Einstein Analytics Data Connectors:
Analytics connectors is an easy way to integrate the data inside and outside of Salesforce org and introduce it into Einstein Analytics. It’s like a plug which you can establish a connection from your data source. Then use that connection to pull the relevant data or objects from that source into EA. Then you can prepare the data in these replications or dataset recipe or in the data flow to create your final resulting dataset.
Below is the overview of existing Einstein Analytics connectors available in the environment. I am focusing on Heroku Postgres connector for this blog (Snapshot Below) :
Below is the Architecture design overview of the integration and system boundaries used in this scenario:
- Step 1: Pulling the Opportunity data from Salesforce via Heroku Connect. (Polling Frequency: Every 10 mins)
- Step 2: Pushing the Opportunity data into Postgres SQL database
- Step 3: Pulling the data out from Heroku Postgres via EA Heroku Connector in the replication cache (Frequency: Every Hour)
- Step 4: Pulling the data out from EA Replication via EA Data flows to register my final Heroku based dataset for my dashboards. (Frequency Every 4 hours)
1) Heroku Sign up: You need to sign up for the free Heroku account to test this use case. Set your role as a hobbyist and select any preferred development language for the sign-up. Signup Link
2) Create New Heroku App: for this use case, I am creating a simple Heroku app. Then later assign this app to my new Heroku database via Heroku Connect.
Make sure to have lowercases for your app name. In my case, I am creating an app iam-trailblazer for the demo purpose (Snapshot Below).
3) Heroku Connect Installation: In this step, we will add Heroku connect add-on and associate this add-on to your app in my case “iam-trailblazer”. Then Go to the overview page tab of your Heroku application and then press ‘Configure Add-ons’ at the very right corner of the application. (Snapshot Below)
Then you will be re-directed to the new ‘Resource’ tab. Then you need to press the ‘Find More add-ons’ button.
Then select Heroku connect from the selection for your app first add-on and install the application. (snapshot below) :
In the next step, you have to provision this add-on to your app. In my case iamtrailblazer and select the plan, you have opted for and press the ‘provision add-on button’ (snapshot below).
4) Heroku Postgres SQL Database Configuration: Follow the same step above in order to find the Heroku Postgres SQL. Then use this add-on for your app database provisioning.
Then select Heroku Postgres from the selection for your app first add-on and install the application (snapshot below).
In the next step, you have to provision this database add-on to your app. In my case iamtrailblazer and select the plan, you have opted for or hobby dev free version and press the ‘provision add-on button’ (snapshot below).
Now after these steps, you have already configured the above two add-ons and assigned those to your app (snapshot below).
5) Setting up Connection (Heroku – Salesforce): In order to pull the data from Salesforce instance, we have to establish the connection between Heroku Connect & PostgreSQL & Salesforce.
Press ‘Heroku Connect’ add-on from the resources tab. Then press ‘Setup Connection’ (snapshot below).
Then Select the Postgres database and enter the schema name for storing your data and press NEXT. Please save this “Schema” name somewhere on the notepad as you might need later for your connection establishment within Einstein Analytics (snapshot below).
Now we have to authorize and link the connection between the app and Salesforce data. In the next step, select the Salesforce environment on which you want your data to be pulled out. And select production in case of free developer accounts you are testing (Snapshot below).
Then press ‘Allow’ to give Heroku application access to your Salesforce application in order to pull the data from the target salesforce environment (snapshot below).
6) Heroku Connect Salesforce Mapping: In this step, we will create our first ‘Opportunity Object’ mapping for our Salesforce Object in Heroku from the ‘Mappings’ tab (Snapshot below).
Then select the Salesforce Object you wish to map. In our case, I am taking ‘Opportunity’ object for your mapping (Snapshot below).
Then select the ‘Desired’ fields you want to synchronize from the mapping table. Leave the poll frequency to 10 minutes (default) and SAVE your mapping (Snapshot Below).
Based on the mapping data will be pulled out automatically. Then assign it to your app and database (Our case: 706 Opportunity Records).
If you like you can also explore the record rows by pressing ‘Explore’ button at the very right corner.
Note: In case if you want to trigger the synchronization based upon any criteria or data changes in your mapped object. Then you can also create custom data ‘Accelerate polling ‘which is based on Salesforce streaming API to trigger the synchronization from Salesforce to Heroku.
7) Heroku – Einstein Analytics Replication Connection: In this step, we will create a Heroku connection from Einstein Analytics. In order to fetch our Opportunities and then store into the replication cache. Please select the ‘Data’ tab from the right corner of your profile. Then select ‘Database’ from the selection (Snapshot Below).
Then select your Postgres SQL database from the selection. Then from the ‘Settings’ tab and you need to view your database credentials by pressing ‘View Credentials’ button at the very right corner. Copy your all credentials on the notepad. As you will need it later to establish the connection between Einstein Analytics & Heroku.
- Host: Your host credentials
- Database: Your database credentials
- User: Your user credentials
- Port: Your port credentials
- URI: Your port credentials (JDBC Connection URL)
- Password: Your password credentials
Now go to your Salesforce Environment and then Analytics Studio. Then you need to press ‘Data Manager’ in the very right corner (Snapshot below).
Press ‘Setup’ from the left panel and then press ‘Set-Up Replication’ at the very right corner (Snapshot Below).
Select the connection type ‘Heroku Postgres’ for your new remote connection (Snapshot Below).
Then insert your earlier copied database credentials and give some unique name to this connection and developer name with some description and press save.Make sure to put the right syntax in the credentials. Esp. your schema name and JDBC connection URL (Snapshot Below)
“Note: Make sure to remove the blank spaces and do the right formatting structure for your JDBC Connection URL” (Sample Below) :
Syntax : jdbc:postgresql://host:port/database
Now the connection has been established. In the next step go to your ‘Connection’ select some specific fields from your replication in order to make those available for your data flow to use and Press Continue and then SAVE (Snapshot Below) .
Now you can see your new ‘Heroku Postgres’ connection is there in Einstein Analytics data manager tab for your data flow to use for future and you can set your scheduling for this connection from there (Snapshot Below).
In the last step run the Heroku Postgres replication and monitor the job run
8) EA Replication – EA Dataflow: In this step, we will create our first data flow to use our Heroku opportunities from the replication cache. In order to register those as our ‘First Heroku Dataset’ in Einstein Analytics.
Create a new data flow and in my case, I am giving my data flow name as ‘Heroku Einstein Dataflow’(Snapshot Below).
Then select the digest node with an umbrella icon (Snapshot Below).
Then give a node name and select your connection, source object and fields you want to use for your data flow (Snapshot Below).
In the next step, select the register node in order to make this dataset further available for your dashboards(Snapshot Below).
In the last step run your data flow and check your dataset in the shared app folder.
- In the above scenario, I have focused on unidirectional flow to ingest the data into EA from Heroku. But this could be easily transformed into bi-directional using n number of target SFDC instances for your use case using EA API’s.
- Think ahead about designing the data schemas for your data models. Eventually, the data will be incrementally increased over the time. You don’t want to put yourself in a position where you will be having bottlenecks in your architecture and have to re-engineer your schemas from scratch.
- Performance testing should be equally considered individually for your use cases in running sprints. It’s not only about having the right KPI number on the UI. But underlined processes are equally important how you have derived that number on a long run for scalable reporting architecture. You don’t want to close one ticket to open another 10 tickets ;))