Groovy is a powerful concise language that has many aspects that lend itself well to ETL (Extract-Transform-Load) jobs. It has similarities to languages like Perl, Ruby and other scripting languages, but has the advantage of being able to easily hook into the Java ecosystem of third party libraries.

This article intends to review some of the language features that lend itself well toETL.

Regular Expressions
Regular expressions are one of the foundations of ETL. No matter what tool or methods are used to perform ETL, regular expressions are used to help parse unstructured or semi-structured text.

Groovy offers a simplified regular expression syntax. Some examples below:

def str = "1-234"
// Matcher.find() on str looking for a String containing 1 digit followed by a dash and 3 digits.
if (str =~ /\d{1}\-\d{3}/) {
    println "Matched"
}

def matcher
str = "This is a test String of 1234 and 5678."
if ((matcher = (str =~ /.+?(\d+).+?(\d+)/))) {
    // Prints 12345678
    println matcher.group(1) + matcher.group(2)
}

More information:
Groovy Regex Reference
Groovy Regex Tutorial 1
Groovy Regex Tutorial 2
Good general Regex Guide

XML processing
Anyone who has done XML processing with Java knows that the code can tend to be extremely verbose and cumbersome. Processing XML with Groovy is a big upgrade over libraries like JDOM.

def xml = 
"""<Rows>
<Row>
<Name>Name1</Name><Description>Description1</Description>
</Row>
<Row>
<Name>Name2</Name><Description>Description2</Description>
</Row>
</Rows>"""

def node = new XmlParser().parseText(xml)
node.Row.each {
    println "${it.Name.text()} - ${it.Description.text()}"
}

More information:
Groovy XML

SQL
Frequently during ETL operations you’ll have to interact with a standard relational database. While there are many tools such as Hibernate and others…to get adequate performance often raw JDBC is necessary. Groovy makes working withSQL databases easy.

The following example is using a standard test table from a H2 database.

import groovy.sql.Sql

def sql = Sql.newInstance('jdbc:h2:tcp://localhost/~/test', 'sa', '', 'org.h2.Driver')

sql.rows("SELECT id, name FROM test").each {
    println "${it.id} - ${it.name}"
}

sql.execute("INSERT INTO TEST (id, name) VALUES(?, ?)", [3, "New Value"]);

// Show new row of 3, New Value
sql.rows("SELECT id, name FROM test").each {
    println "${it.id} - ${it.name}"
}

sql.execute("UPDATE test set name=? where id =?", ['NEW VALUE', 3])

// Show updated row of 3, NEW VALUE
sql.rows("SELECT id, name FROM test").each {
    println "${it.id} - ${it.name}"
}

sql.execute("DELETE FROM test where id = ?", [3])
sql.commit()

More Information:
Groovy SQL
Groovy SQL Tutorial

File I/O
File based I/O is one of the most common operations of a ETL environment. Typically you’ll receive a CSV or other file based format that needs to be parsed and loaded somewhere. Groovy makes reading and writing to files a breeze as follows:

// Read each line and split
new File("file.txt").eachLine {
    def row = it.split(",")
    println "${row[0]} - ${row[1]}"
}

// Read each line and split using groovy method
new File("file.txt").splitEachLine(",") {
    it.each { print "$it " }
    println ""
}

// Write content out to file
new File("file2.txt").write("1, 2, 3, 4\n")

More Information:
Groovy I/O

Groovy JDK
The Groovy JDK is an add on feature to common Java classes. This adds additional features to already existing classes within the JDK.

Groovy also adds language operators to allow more concise null checking syntax. For example the Elvis and Safe Navigator operator automatically defaults and performs null checks to prevent the developer from manually having to perform these checks.

The following examples demonstrate both Groovy JDK and other language features.

def values = [1, 2, 3, 4]

// Print each value in the list
values.each {
    println it
}

// Multiplies each value by 2 and prints it to the console
values.collect { it*2 }.each {
    println it
}

// Print the contents of file.txt
println new File("file.txt").text

// Parse a date string
println Date.parse("yyyy-MM-dd", "2012-01-01")

// Print current time in UTC
println new Date().format("yyyy-MM-dd'T'HH:mm:ss", TimeZone.getTimeZone("UTC"))

//Groovy maps
def map = [:]
map.key = "value"

println map

def value = null

// Print value or if NULL print DEFAULT using Elvis (?:) operator
println value ?: "DEFAULT"

// NullPointer protection using the ?. operator
println value?.text

More Information:
Groovy JDK

This is only a basic list of how Groovy can assist with ETL tasks. These tasks can be run standalone in simple configurations. They could also be integrated with Java based ETL frameworks and tools to provide clustering, parallelization and scalability support.