We can create dataflow jobs from the cloud console using google provided templates. For example, if we need to create a real time streaming job which reads data from pub/sub and write the data into a bigquery table we can select the pub/sub to Bigquery template while creating the dataflow job along with all required parameters. However, if we have different business use cases such as we have to read from multiple pubsubs topics/subs and write the data to multiple tables or if there are many transformation logic need to be implemented then we can create custom dataflow templates using Apache Beam. We can create an Apache Beam pipeline using java SDK or Python SDK. Here, in this article we will see how the beam pipeline can be implemented using java SDK.
Let's say we have a requirement where we need to create a custom dataflow job which should read data from pubsub subscription and append the data to a bigquery table.
input subscription:
output table:
let's create the beam pipeline using java. you can download the code from repo - https://github.com/dataengineertech/gcpdataproc
once the code is properly compiled, we can deploy the code in cloud to create our dataflow job. you can use a compute engine instance to build and deploy the code with maven. Here, we will be doing it using cloud shell.
login to cloud shell and clone the branch.
git clone https://github.com/dataengineertech/gcpdataproc.git
Once you clone the repo it will create a folder named as gcpdataproc. go inside the folder like below:
go inside beam-pubsub-bq-maven folder and run the below command to create the dataflow job.
mvn -Pdataflow-runner compile exec:java -Dexec.mainClass=PubsubtoBigquery -Dexec.args=" --project=adroit-nectar-363812 --jobName=beam-orders-1234 --stagingLocation=gs://orders-customers-bucket1/staging --gcpTempLocation=gs://orders-customers-bucket1/dataflow/temp --runner=DataflowRunner --region=us-west1"
Once the build is success, it will create the dataflow the job and you can see that in the job in the dataflow jobs section.
Now, job got created. let's test it. We will send the data from topic manually and it should append the data to the BQ table.
This job we have created with basic properties and parameters. In enterprise, we have to pass parameters such as network, subnetwork if the project VPC enabled. we can also create the template using maven and build the job from console or using terraform.