Sqoop daily Oracle data into Hive table partition

The following bash script can be used to import Oracle records into a Hive table, partitioned by date. It uses Sqoop. Both Hive and Sqoop are part of typical Hadoop distributions, like the Hortonworks Sandbox, for example.

#!/bin/sh

function upper() {
  echo "$1" | tr [a-z] [A-Z]
}

if [ $# -ge 5 ]; then
  schema=$(upper $1)
  table=$(upper $2)
  column_to_split_by=$(upper $3)
  date_column=$(upper $4)
  date_value="$5"
else 
  echo
  echo "Usage: $(basename $0) schema table column-to-split-by date-column YYYY-MM-DD"
  echo
  echo "Imports all records where value of date-column is \$date_value from"
  echo "Oracle table \$schema.\$table as a Hive table partition."
  echo "Hadoop will split the import job based on the column-to-split-by."
  echo "* The table must have the columns specified as column-to-split-by and date-column."
  echo "* The column-to-split-by must be finer granularity than date-column, ideally unique."
  echo "* The date_value must be in YYYY-MM-DD format."
  echo "* If date_value is unspecified, the current date will be used."
  exit 1
fi

echo "schema = $schema"
echo "table = $table"
echo "column_to_split_by = $column_to_split_by"
echo "date_column = $date_column"
echo "date_value = $date_value"

# we have to drop the partition, because --hive-overwrite does not seem to do it
hive -e "use $schema; alter table $table drop if exists partition($date_column='$date_value');"

columns=$( \
sqoop eval \
--options-file /usr/local/etc/sqoop-options.txt \
--query "select column_name from all_tab_columns where table_name = '$table'" \
| tr -d " |" \
| grep -Ev "\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-|COLUMN_NAME|$date_column" \
| tr '\n' ',' \
| sed -e 's/\,$//'
)

query="select $columns from $schema.$table \
       where $date_column = to_date('$date_value', 'YYYY-MM-DD') \
       and \$CONDITIONS"

echo "query = $query"

sqoop import \
--options-file "/usr/local/etc/sqoop-options.txt" \
--query "$query" \
--split-by "$column_to_split_by" \
--target-dir "$schema.$table" \
--hive-import \
--hive-overwrite \
--hive-table "$schema.$table" \
--hive-partition-key "$date_column" \
--hive-partition-value "$date_value" \
--outdir $HOME/java

JDBC connection details

Put them into /usr/local/etc/sqoop-options.txt, in a format like this:

--connect
jdbc:oracle:thin:@hostname:port:hostname
--username
oracle_username
--password
oracle_password

Advertisement

Make apps4halifax – Intro

Halifax Regional Municipality is introducing ‘apps4halifax‘, its first-ever Open Data App Contest. Similar initiatives have been successful in Ottawa, Edmonton and many other cities worldwide.

Residents can submit ideas or code apps using the HRM Open Data catalog. The best submissions may win cash prizes and awards.

The Open Data catalog is implemented using the Socrata Open Data Portal. The SODA 2.0 restful web service API allows developers to query and consume live data from the public data-sets.

The Socrata developer documentation explains how to use queries to endpoints, supported datatypes and response formats.

The currently available data-sets include Crime occurrences, Building types, Buildings, Bus Routes, Bus Stops, Bylaw Areas, Civic Addresses, Community Boundaries, Park Recreation Features, Parks, Polling Districts, Streets, Trails, Transit Times, Waste collections and Zoning Boundaries.

You can construct web service query URLs like this:

You can determine the RESOURCE-ID for a dataset like this:

  1. Go to https://www.halifaxopendata.ca/
  2. Click on a dataset name
  3. Click the “Export” button
  4. Under “Download As”, copy one of the links, e.g. JSON
  5. The resource id is the part of the URL between “views/” and “/rows.json”

As an extremely useful example, you could query fun things like all HRM garbage collections occurring on Wednesdays:

http://www.halifaxopendata.ca/resource/ga7p-4mik.json?collect=’WEDNESDAY’

As a simple and quick way to create web pages that interact with these web services you could use JQuery and its getJSON() function.

I will probably follow up with more posts on this topic soon.