You need to have Java => 1.6 preinstalled. Download a release of Splout from Maven Central:
After that, you just need to decompress it.
|SPLOUT_HADOOP_COMMON_HOME||Pointing to the folder where the hadoop-common-*.jar can be found|
|SPLOUT_HADOOP_HDFS_HOME||Pointing to the folder where where the hadoop-mapreduce-client-*.jar can be found|
|SPLOUT_HADOOP_MAPRED_HOME||Pointing to the folder where the hadoop-common-*.jar can be found|
|SPLOUT_HADOOP_CONF_DIR||Optional: Optionally, specify the Hadoop configuration folder (e.g. /etc/hadoop/conf). Will default to SPLOUT_HADOOP_MAPRED_HOME/conf|
Example of env variables for Cloudera CDH5 using parcels
export SPLOUT_HADOOP_COMMON_HOME=/opt/cloudera/parcels/CDH/lib/hadoop export SPLOUT_HADOOP_HDFS_HOME=/opt/cloudera/parcels/CDH/lib/hadoop-hdfs export SPLOUT_HADOOP_MAPRED_HOME=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce export SPLOUT_HADOOP_CONF_DIR=/etc/hadoop/conf
HADOOP_HOME must be properly defined pointing to your Hadoop installation folder.
The best thing for getting started with Splout is launching a local server in your machine:
bin/splout-service.sh qnode start bin/splout-service.sh dnode start
Now, if everything went fine, and assuming you had nothing running on port 4412 before, you will see a nice panel in localhost:4412 like this:
Notice how there is no tablespace loaded in the system. That is expected, we will load an example one later. In logs/ folder you will find the logs of both the QNode and the DNode service. You can shutdown the services anytime with:
bin/splout-service.sh qnode stop bin/splout-service.sh dnode stop
Splout comes with a few examples, one of which consists of miscellaneous world facts. We will load it and play with it to get used to the system and its behavior.
First, upload to HDFS the examples folder, as it contains the input files for all examples.
hadoop fs -put examples examples
Load the table city from the CSV file city.csv and use the country_code field to partition it into 4 partitions:
hadoop jar splout-*-hadoop*.jar simple-generate --input examples/world/city.csv --output \ database-files --tablespace city_pby_country_code --table city --separator , --escape \\ --quotes \ \"\"\" --nullstring \\N --schema \ "id:int,name:string,country_code:string,district:string,population:int" --partitionby country_code \ --partitions 4
Finally, deploy the results into Splout, using replication 2. That is, there will be 2 replicas per each partition:
hadoop jar splout-*-hadoop*.jar deploy --root database-files --tablespaces city_pby_country_code \ --replication 2 --qnode http://localhost:4412
In this particular example, as you have only one dnode, replication will be adjusted automatically to 1. Look at the panel to see the information of the created tablespace:
Particularly, look at how partitions has been distributed using different country_code ranges. Now, you can launch some queries at the query console:
Perform the following query to show all the tables in the tablespace:
|city_pby_country_code||(empty)||SELECT * FROM sqlite_master WHERE type='table';|
Now, let's find the cities with country_code
|city_pby_country_code||JPN||SELECT * FROM city WHERE country_code = "JPN"|
Note that the shard 2 (partition 2) was hit. That is because we provided "JPN" as partition key. Otherwise, a wrong partition would have been hit, and no results would have been presented. Perform a test. Repeat the query, but keep the partition key empty. You'll see no results and partition 0 being hit.
The former query can be performed just using the REST Splout SQL interface. For the former query the URL would be the following:
Multiple tables per tablespace are allowed. But all of them must be partitioned by the same key. We are going to create two different tablespaces:
|country||partitioned by country|
|city||partitioned by country_code|
|country_language||full table present in all partitions|
|country||partitioned by continent and region|
|country_language||full table present in all partitions|
The structure of these tablespaces, and the the specification of the input data are in the files: world-pby-country.json and world-pby-continent-region.json
Launch the following command to create both tablespaces:
hadoop jar splout-*-hadoop*.jar generate --output database-files --tablespacefile \ examples/world/world-pby-country.json --tablespacefile examples/world/world-pby-continent-region.json
Have a look to the database-files folder:
hadoop fs -ls database-files
At this point, the tablespace files have been created, but they are not still present at Splout SQL. We have to deploy them. We can deploy both tablespaces at the same time atomically, so we can be sure that information between tablespaces will be consistent. Whichever of the following command performs the deploy of the tablespaces.
hadoop jar splout-*-hadoop*.jar deploy -root database-files -ts world-pby-continent-region -ts world-pby-country -r 2 -q http://localhost:4412
hadoop jar splout-*-hadoop*.jar deploy --config-file examples/world/deployment.json --qnode http://localhost:4412
Now both tablespaces should be present at Splout SQL. Have a look to the console.
You can also check out the following queries:
|Tablespace||Partition key||SQL Query|
|All Japan Languages:||world-pby-country||JPN||SELECT country_language.* FROM country, country_language WHERE country.code = country_language.country_code AND country.code = "JPN"|
|Distinct districs in Japan:||world-pby-country||JPN||SELECT country.name, count(distinct district) as num_districts FROM country, city WHERE country.code = "JPN" and country.code = city.country_code|
|Population of Western Europe:||world-pby-continent-region||EuropeWestern Europe||SELECT continent,region,sum(population) total_population FROM country WHERE continent = "Europe" AND region = "Western Europe";|
|Biggest countries on Central Africa:||world-pby-continent-region||AfricaCentral Africa||SELECT name, surface_area FROM country WHERE continent = "Africa" AND region = "Central Africa" ORDER BY surface_area DESC;|
|Most talked languages on Central Africa:||world-pby-continent-region||AfricaCentral Africa||SELECT language, sum((percentage/100)*population) as people FROM country, country_language WHERE country.code = country_language.country_code AND continent = "Africa" AND region = "Central Africa" GROUP BY language ORDER BY people DESC;|
Note that all the queries presented before are compatible with the partitioning decided for the tablespace. Otherwise they would have not worked properly. Also, be careful to properly define the needed indexes in order to answer queries fast.