How do I get my data into BigQuery?
Let’s compare the options for loading data in BigQuery.
Factors we’ll use are:
Complexity and Risks
If you look at Google’s Business Intelligence guide:
We’ll compare the first four options above and then we’ll take a look at the Partner options.
Cloud Dataflow and Cloud Dataprep
Cloud Dataflow and Cloud Dataprep can be mentioned together. Cloud Dataprep is a product that generates templates which run on the Dataflow service.
Both are slow and expensive in comparison with rockstarETL‘s loading option and BigQuery’s awesome query processing power.
Most Dataflow transforms I have created, run for about 30 minutes:
This was when I used an ETL paradigm. After re-writing the transforms into SQL and running them in BigQuery (consistent with an ELT paradigm), I got them down to less than 2 minutes! I use rockstarETL to schedule all of my loads and SQL queries.
With Cloud Dataflow you can create batch and streaming pipelines. Unless you use the templates provided, you are in in for one of the steepest learning curves ever. Please do not take this as a challenge! The cost in both time, effort and processing dollars simply do not outweigh any benefit you will receive!
I’ll give you an example:
This is the streamlined helper library to help you perform joins:
Honestly, you are better off using rockstarETL to load your data straight into BigQuery. From there you can write all of you transforms into familiar SQL. It is recommended that you build all of your load and transforms in the same pipeline inside rockstarETL.
If you absolutely just love Dataflow or Dataprep, rockstarETL supports these job types as well! You can simply add them as steps into your pipeline.
The streaming dataflow option is fairly attractive but I struggle to find the extra cost justified as there are no BigQuery charges for loading data (this is the option used by rockstarETL) and loading is fast.
Also streaming takes a lot more setup. Having said that, you could setup a streaming job and then use rockstarETL’s Conditional Job Type as a start for further processing in a new pipeline!
Being maintained by a third party, Trifacta’s DataPrep offering locks you into their platform. It takes a significant investment of time and energy to learn their product and write your ETL transforms. While DataPrep is an amazing product and quite user-friendly, I found creating joins and unions to be painful while their tool loaded sample data. Also if you’ve ever had to wait for their sample to complete you know exactly what I’m referring to:
I would much rather just use SQL for joins and transforms – so much faster to inspect and maintain.
Another pain with DataFlow is having to update your SDK version:
There is a fairly aggressive deprecation policy so you’ll find yourself having to update regularly. This is especially painful on DataPrep: you have to go into every flow you’ve created and into every export you are using and re-create the export in order to generate a new template file:
Another important point to note: While DataPrep does allow you to schedule a flow:
That pricing will probably make most exploratory analysis unfeasible…
This is not even mentioning the time and effort required to learn and set it up!
They have 3 options: “Extra Large”, “Large” and “Medium”
Take note of the “Solution trial credit”
This is a once – off arrangement. So you have to add back the credit to see what your actual monthly price will be.
"Extra Large" Pricing:
With Matillion, like many other partner integrations, the problem is that you are writing all of your business logic into their product.
If they ever shut down, you lose the time and effort you invested learning their solution.
If they change their product, your logic may be at risk when the thing you wrote it in is deprecated.
With rockstarETL, your logic is stored in:
1. Views that you create and control which are stored in your dataset in your BigQuery project!
2. Queries that you wrote and are stored in text files inside of your cloud storage bucket!
In rockstarETL, you only store which steps you want to execute, in the order and frequency you specify.
All of the logic is stored inside of your own Google Cloud Project! You don’t build anything into someone else’s product!