What I want to emphasize is that csv is such a convenient intermediate format for processing and saving data. It’s concise and can be analysised by SQL statement through a simple step:
- load to MySQL by
load data infile
- or query by Spark SQL directly
Methods of querying csv files from Spark SQL often refers CsvContext
dependency. Here I introduce another way.
-
Data sample
2016,11,abc,jack 2016,10,bb,jim 2015,03,test,paul 2014,09,good,simth
-
Create a hive table for data
CREATE EXTERNAL TABLE IF NOT EXISTS test ( year STRING, day STRING, tag STRING, name STRING ) row format delimited fields terminated by ',' stored as textfile location 'file:///tmp/test/csvs'
-
Query and analysis
$ ./bin/spark-shell Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> val table = """ | CREATE EXTERNAL TABLE IF NOT EXISTS test ( | year STRING, | day STRING, | tag STRING, | name STRING | ) | row format delimited | fields terminated by ',' | stored as textfile | location 'file:///tmp/test/csvs'""" scala> sqlContext.sql(table) res0: org.apache.spark.sql.DataFrame = [result: string] scala> val res = sqlContext.sql("select * from test") res: org.apache.spark.sql.DataFrame = [year: string, day: string, tag: string, name: string] scala> res.show() +----+---+----+-----+ |year|day| tag| name| +----+---+----+-----+ |2016| 11| abc| jack| |2016| 10| bb| jim| |2015| 03|test| paul| |2014| 09|good|simth| +----+---+----+-----+