Query CSV Data with Apache Drill

The most interesting feature of Apache Drill (https://drill.apache.org) is the functionality of querying the raw data in original sources, e.g. json, parquet, or even csv, directly from the file system through the so-called storage plugin. In the demonstration below, I will show how to extract the data from a csv file.

To do the test drive of Apache Drill without the installation, I downloaded the docker container from https://hub.docker.com/r/harisekhon/apache-drill and then started the container by “docker run -it harisekhon/apache-drill”.

Once in the Drill shell, we can use “show schemas;” to list storage plugins that have been configured in the system. As shown below, while there are several default storage plugins, the one that we will use is “dfs” that points to the local file system.


0: jdbc:drill:zk=local> show schemas;
+---------------------+
|     SCHEMA_NAME     |
+---------------------+
| INFORMATION_SCHEMA  |
| cp.default          |
| dfs.default         |
| dfs.root            |
| dfs.tmp             |
| sys                 |
+---------------------+

We can explore file formats supported in the dfs local file system by the command “describe schema dfs;”. As shown below, in the pre-configured storage plugin, 2 entries are specifically related to the csv data format that will be used later. The entry “csv” supports data files without headers and the entry “csvh” supports data files with headers.


    "csv" : {
      "type" : "text",
      "extensions" : [ "csv" ],
      "delimiter" : ","
    },
    "csvh" : {
      "type" : "text",
      "extensions" : [ "csvh" ],
      "extractHeader" : true,
      "delimiter" : ","
    }

Next, we can use the command “show files in dfs;” to explore the file system. For instance, running the command “show files in dfs.`apache-drill/sample-data`;” will list all files in the folder “sample-data”, which includes a csv file named “nycflights.csv” that was copied from the host by “docker cp nycflights.csv 1fc4a0087a3e:/apache-drill/sample-data”, where “1fc4a0087a3e” is the container_id.

From the initial output below, it appears that there are headers in this csv file.


0: jdbc:drill:zk=local> select columns[0] as x1, columns[1] as x2 from dfs.`apache-drill/sample-data/nycflights.csv` limit 5;
+-------+--------+
|  x1   |   x2   |
+-------+--------+
| year  | month  |
| 2013  | 1      |
| 2013  | 1      |
| 2013  | 1      |
| 2013  | 1      |
+-------+--------+

As a result, we need to change the file extension from “csv” to “csvh” in the file system and then query the data again. It turns out successful this time.


0: jdbc:drill:zk=local> select `year`, `month` from dfs.`apache-drill/sample-data/nycflights.csvh` limit 5;
+-------+--------+
| year  | month  |
+-------+--------+
| 2013  | 1      |
| 2013  | 1      |
| 2013  | 1      |
| 2013  | 1      |
| 2013  | 1      |
+-------+--------+

From here, we can explore the data in the raw csv file directly. For instancce, we can aggregate the number of flights by airports.


0: jdbc:drill:zk=local> select origin, count(*) as cnts from dfs.`apache-drill/sample-data/nycflights.csvh` group by origin;
+---------+---------+
| origin  |  cnts   |
+---------+---------+
| JFK     | 111279  |
| EWR     | 120835  |
| LGA     | 104662  |
+---------+---------+

Or we can extract all flights from the airport LGA and then export the data to a JSON file for the future analyses.


0: jdbc:drill:zk=local> use dfs.tmp;
+-------+--------------------------------------+
|  ok   |               summary                |
+-------+--------------------------------------+
| true  | Default schema changed to [dfs.tmp]  |
+-------+--------------------------------------+

0: jdbc:drill:zk=local> alter session set `store.format` = 'json';
+-------+------------------------+
|  ok   |        summary         |
+-------+------------------------+
| true  | store.format updated.  |
+-------+------------------------+

0: jdbc:drill:zk=local> create table dfs.tmp.lga as select * from dfs.`apache-drill/sample-data/nycflights.csvh` where origin = 'LGA';
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 104662                     |
+-----------+----------------------------+

0: jdbc:drill:zk=local> show files in dfs.tmp.`lga`;
+-------------+--------------+---------+-----------+--------+--------+
|    name     | isDirectory  | isFile  |  length   | owner  | group  |       
+-------------+--------------+---------+-----------+--------+--------+
| 0_0_0.json  | false        | true    | 34156554  | root   | root   | 
+-------------+--------------+---------+-----------+--------+--------+