days
0
-31
0
hours
-1
-6
minutes
-2
0
seconds
-5
-3
search
Tutorial

Cloud Dataflow quickstart in Java

Martin Gorner
java

Group of rusty transmission gears image via Shutterstock

In this article JAX London speaker Martin Gorner shows you how to set up your Google Cloud Platform project to use Cloud Dataflow, create a Maven project with the Cloud Dataflow SDK and examples, and run an example pipeline using the Google Cloud Platform Console.

Before you begin

  1. Select or create a Cloud Platform Console project.
    Go to the Projects page
  2. Enable billing for your project.
    Enable billing
  3. Enable the Cloud Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Cloud Storage JSON, BigQuery, Cloud Pub/Sub, and Cloud Datastore APIs.
    Enable the APIs
  4. Install the Cloud SDK.
  5. Authenticate gcloud with Google Cloud Platform:
    gcloud init
    
  6. Create a Cloud Storage bucket:
    1. In the Cloud Platform Console, go to the Cloud Storage browser.
      Go to the Cloud Storage browser
    2. Click Create bucket.
    3. In the Create bucket dialog, specify the following attributes:
      • Name: A unique bucket name. Do not include sensitive information in the bucket name, as the bucket namespace is global and publicly visible.
      • Storage class: Standard
      • Location: United States
    4. Click Create.
  7. Download and install the Java Development Kit (JDK) version 1.7 or later. Verify that the JAVA_HOME environment variable is set and points to your JDK installation.
  8. Download and install Apache Maven by following Maven’s installation guide for your specific operating system.

Create a Maven project that contains the Cloud Dataflow SDK for Java and examples

    1. Create a Maven project containing the Cloud Dataflow SDK for Java using the Maven Archetype Plugin. Run the mvn archetype:generate command in your shell or terminal as follows:
      mvn archetype:generate \
            -DarchetypeArtifactId=google-cloud-dataflow-java-archetypes-examples \
            -DarchetypeGroupId=com.google.cloud.dataflow \
            -DgroupId=com.example \
            -DartifactId=first-dataflow \
            -Dversion="[1.0.0,2.0.0]" \
            -DinteractiveMode=false \
            -Dpackage=com.google.cloud.dataflow.examples
      

After running the command, you should see a new directory called first-dataflow under your current directory. first-dataflow contains a Maven project that includes the Cloud Dataflow SDK for Java and example pipelines.

Run an example pipeline on the Cloud Dataflow service

  1. Change to the first-dataflow/ directory.
  2. Build and run the Cloud Dataflow example pipeline called WordCount on the Cloud Dataflow managed service by using the mvn compile exec:java command in your shell or terminal window. For the --project argument, you’ll need to specify the Project ID for the Cloud Platform project that you created. For the --stagingLocation and --output arguments, you’ll need to specify the name of the Cloud Storage bucket you created as part of the path.For example, if your Cloud Platform Project ID is my-cloud-project and your Cloud Storage bucket name is my-wordcount-storage-bucket, enter the following command to run the WordCount pipeline:
      mvn compile exec:java \
          -Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \
          -Dexec.args="--project=<my-cloud-project> \
          --stagingLocation=gs://<my-wordcount-storage-bucket>/staging/ \
          --output=gs://<your-bucket-id>/output \
          --runner=BlockingDataflowPipelineRunner"
    
  3. Check that your job succeeded:
    1. Open the Cloud Dataflow Monitoring UI in the Google Cloud Platform Console.
      Go to the Cloud Dataflow Monitoring UI
      You should see your wordcount job with a status of Running at first, and then Succeeded:
  • Open the Cloud Storage Browser in the Google Cloud Platform Console.Go to the Cloud Storage browser
    In your bucket, you should see the output files and staging files that your job created:

  • Clean up

    To avoid incurring charges to your Google Cloud Platform account for the resources used in this quickstart:

    1. Open the Cloud Storage browser in the Google Cloud Platform Console.
    2. Select the checkbox next to the bucket that you created.
    3. Click DELETE.
    4. Click Delete to permanently delete the bucket and its contents.
    Author

    Martin Gorner

    Martin is passionate about science, technology, coding, algorithms and everything in between. He graduated from Mines Paris Tech with a major in computer vision, enjoyed his first engineering years in the computer architecture group of ST Microlectronics and then spent the next 11 years shaping the nascent eBook market, starting with the Mobipocket startup, which later became the software part of the Amazon Kindle and its mobile variants. He joined Google Developer Relations in 2011 and now enjoys playing with parallel processing and machine learning fields (Dataflow and Tensorflow).


    Leave a Reply

    avatar
    400