The most important framework of Big Data is Apache Spark. Apache spark is a multi language engine for executing large set of data on a single node machine or in a distributed platform. Google Cloud Dataproc provides us the required cluster to run our spark jobs in distributed mode. There are various ways to run and schedule spark jobs in GCP however, here we will see how to create a dataproc cluster and run a pyspark job on it.
lets create a pyspark job which reads data from orders-payments BigQuery table and find the total amount paid per each payment type and write the output to payments-summary BigQuery table.
In order to run a spark job on google cloud, we need to create a cluster in the first place. go to the google cloud console, open dataproc and follow the below steps.
click on create cluster and set up your cluster configuration as per your need. you can create a standard cluster with 1 Master and N workers
your cluster will be created once you select create and you can click on the cluster to see the cluster details.
the input bigquery table -orders-payments as below.
lets create the output table payments-summary. Go to the the cloud console and find BigQuery and create the table.
lets write the pyspark program which reads the data from orders-payments bigquery table and implement the aggregation logic. It also loads the data to the payments-summary bigquery table.
you can view the code here - https://github.com/dataengineertech/gcpdataproc/blob/main/examples/dataproc/pyspark/pyspark-dataproc.py
Now, we have the code ready. we will move this file to GCS bucket. you can move your file to GCS using google cloud console or using the gcloud command line.
We will now submit our job from the cluster we created. Go to dataproc -> clusters - > select the cluster-> submit the job to see the options to submit the job. select the job type as pyspark. browse the python file from GCS bucket at Main python file section. At jar file, we have to provide the Bigquery connector jar. Bigquery connector jar is needed to establish connection from cluster to Bigquery dataset.
Now, click on submit to see the logs and the output of the pyspark job.
We can verify the output table details in Bigquery.
The pyspark code loads the aggregated data to the payments_summary BigQuery table. if you want to overwrite the existing data in the table, you can use .format('bigquery').mode("overwrite") while saving the dataframe. You can refer to the code here - https://github.com/dataengineertech/gcpdataproc/blob/main/examples/dataproc/pyspark/pyspark-overwrite_table.py
once you see the data in Bigquery, You can delete or stop your cluster from the google cloud console to save your resorces.