您好,欢迎来到化拓教育网。
搜索
您的当前位置:首页sparksql语句

sparksql语句

来源:化拓教育网

(1)in 不支持子查询 eg. select * from src where key in(select key from test);
支持查询个数 eg. select * from src where key in(1,2,3,4,5);
in 40000个 耗时25.766秒
in 80000个 耗时78.827
(2).union all/union
不支持顶层的union all eg. select key from src UNION ALL select key from test;
支持select * from (select key from src union all select key from test)aa;
不支持 union
支持select distinct key from (select key from src union all select key from test)aa;

3.intersect 不支持

4.minus 不支持

5.except 不支持

6.inner join/join/left outer join/right outer join/full outer join/left semi join 都支持
left outer join/right outer join/full outer join 中间必须有outer
join是最简单的关联操作,两边关联只取交集;
left outer join是以左表驱动,右表不存在的key均赋值为null;
right outer join是以右表驱动,左表不存在的key均赋值为null;
full outer join全表关联,将两表完整的进行笛卡尔积操作,左右表均可赋值为null;
left semi join最主要的使用场景就是解决exist in;
Hive不支持where子句中的子查询,SQL常用的exist in子句在Hive中是不支持的
不支持子查询 eg. select * from src aa where aa.key in(select bb.key from test bb);
可用以下两种方式替换:
select * from src aa left outer join test bb on aa.key=bb.key where bb.key <> null;
select * from src aa left semi join test bb on aa.key=bb.key;
大多数情况下 JOIN ON 和 left semi on 是对等的
A,B两表连接,如果B表存在重复数据
当使用JOIN ON的时候,A,B表会关联出两条记录,应为ON上的条件符合; 
而是用LEFT SEMI JOIN 当A表中的记录,在B表上产生符合条件之后就返回,不会再继续查找B表记录了,
所以如果B表有重复,也不会产生重复的多条记录。 
left outer join 支持子查询 eg. select aa.* from src aa left outer join (select * from test111)bb on aa.key=bb.a;

9.表重命名
hive> ALTER TABLE events RENAME TO 3koobecaf; 

10.表增加列
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); 

11.添加一列并增加列字段注释 
hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); 

12.删除表
hive> DROP TABLE pokes; 

13.top n
hive> select * from test order by key limit 10;
14.创建数据库
Create Database baseball;

14.alter table tablename  change oldColumn newColumn column_type 修改列的名称和类型

alter table yangsy CHANGE product_no phone_no string

15.导入.sql文件中的sql

 spark-sql --driver-class-path /home/hadoop/hive/lib/mysql-connector-java-5.1.30-bin.jar -f testsql.sql 

insert into table CI_CUSER_201411171351522 select mainResult.PRODUCT_NO,dw_coclbl_m02_3848.L1_01_02_01,dw_coclbl_d01_3845.L2_01_01_04 from (select PRODUCT_NO from CI_CUSER_20141114203632267) mainResult left join DW_COCLBL_M02_201407 dw_coclbl_m02_3848 on mainResult.PRODUCT_NO = dw_coclbl_m02_3848.PRODUCT_NO left join DW_COCLBL_D01_20140515 dw_coclbl_d01_3845 on dw_coclbl_m02_3848.PRODUCT_NO = dw_coclbl_d01_3845.PRODUCT_NO

insert into CI_CUSER_20141117142123638 ( PRODUCT_NO,ATTR_COL_0000,ATTR_COL_0001) select mainResult.PRODUCT_NO,dw_coclbl_m02_3848.L1_01_02_01,dw_coclbl_m02_3848.L1_01_03_01 from (select PRODUCT_NO from CI_CUSER_20141114203632267) mainResult left join DW_COCLBL_M02_201407 dw_coclbl_m02_3848 on mainResult.PRODUCT_NO = dw_coclbl_m02_3848.PRODUCT_NO 

CREATE TABLE ci_cuser_yymmddhhmisstttttt_tmp(product_no string) row format serde 'com.bizo.hive.serde.csv.CSVSerde' ; 
LOAD DATA LOCAL INPATH '/home/ocdc/coc/yuli/test123.csv' OVERWRITE INTO TABLE test_yuli2;

创建支持CSV格式的testfile文件
CREATE TABLE test_yuli7 row format serde 'com.bizo.hive.serde.csv.CSVSerde' as select * from CI_CUSER_20150310162729786;

不依赖CSVSerde的jar包创建逗号分隔的表
"create table " +listName+ " ROW FORMAT DELIMITED FIELDS TERMINATED BY ','" +
" as select * from " + listName1;

create table aaaa ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE as select * from

ThriftServer 开启FAIR模式
SparkSQL Thrift Server 开启FAIR调度方式:
1. 修改$SPARK_HOME/conf/spark-defaults.conf,新增
2. spark.scheduler.mode FAIR
3. spark.scheduler.allocation.file /Users/tianyi/github/community/apache-spark/conf/fair-scheduler.xml
4. 修改$SPARK_HOME/conf/fair-scheduler.xml(或新增该文件), 编辑如下格式内容
5. <?xml version="1.0"?>
6. <allocations>
7. <pool name="production">
8. <schedulingMode>FAIR</schedulingMode>
9. <!-- weight表示两个队列在minShare相同的情况下,可以使用资源的比例 -->
10. <weight>1</weight>
11. <!-- minShare表示优先保证的资源数 -->
12. <minShare>2</minShare>
13. </pool>
14. <pool name="test">
15. <schedulingMode>FIFO</schedulingMode>
16. <weight>2</weight>
17. <minShare>3</minShare>
18. </pool>
19. </allocations>
20. 重启Thrift Server
21. 执行SQL前,执行 
22. set spark.sql.thriftserver.scheduler.pool=指定的队列名

等操作完了 create table yangsy555 like CI_CUSER_YYMMDDHHMISSTTTTTT 然后insert into yangsy555 select * from yangsy555

创建一个自增序列表,使用row_number() over()为表增加序列号 以供分页查询

create table yagnsytest2 as SELECT ROW_NUMBER() OVER() as id,* from yangsytest;

 

二. API

Spark SQL的API方案:3种

SQL
the DataFrames API
the Datasets API.

但会使用同一个执行引擎

the same execution engine is used

(一)数据转为Dataframe

1、(半)格式化数据(HDFS文件)

SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc)  
// Parquet files are self-describing so the schema is preserved.文件格式自带描述性  
DataFrame df= sqlContext.read().parquet("people.parquet");  
//SQLContext.read().json() on either an RDD of String, or a JSON file. not a typical JSON file(见下面的小实验)  
DataFrame df = sqlContext.read().json("/testDir/people.json");  

Load默认是parquet格式,通过format指定格式

DataFrame df = sqlContext.read().load("examples/src/main/resources/users.parquet");  
DataFrame df = sqlContext.read().format("json").load("main/resources/people.json");  

旧API  已经被废弃 

DataFrame df2 =sqlContext.jsonFile("/xxx.json");  
DataFrame df2 =sqlContext.parquetFile("/xxx.parquet"); 

2、RDD数据

SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc)  

a. 通过类  利用类的反射机制

已有:JavaRDD<Person> people

DataFrame df= sqlContext.createDataFrame(people, Person.class); 

b. 通过schema转换RDD

已有:StructType schema = DataTypes.createStructType(fields);

和JavaRDD<Row> rowRDD

3、 Hive数据(HDFS文件在中的表(schema) 对应关系)

HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);  
DataFrame df = sqlContext.sql("select count(*) from wangke.wangke where ns_date=20161224");  
sqlContext.refreshTable("my_table")  
//(if configured,sparkSQL caches metadata)  
sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)");  
sqlContext.sql("LOAD DATA LOCAL INPATH 'resources/kv1.txt' INTO TABLE src");  
Row[] results = sqlContext.sql("FROM src SELECT key, value").collect(); 

4、特殊用法

DataFrame df = sqlContext.sql("SELECT * FROM parquet.`main/resources/users.parquet`");  
//查询临时表people  
DataFrame teenagers = sqlContext.sql("SELECT name FROMpeople WHERE age >= 13 AND age <= 19")   
(二)、Dataframe使用

1、展示

df.show();  
df.printSchema(); 

2、过滤选择

df.select("name").show();  
df.select(df.col("name"), df.col("age").plus(1)).show();  
df.filter(df.col("age").gt(21)).show();  
df.groupBy("age").count().show(); 

3、写文件 

df.select("name", "favorite_color").write().save("namesAndFavColors.parquet");  
df.select("name", "age").write().format("parquet").save("namesAndAges.parquet");  
df.write().parquet("people.parquet");

4、注册临时表

df.registerTempTable("people");  

之后就可以用SQL在上面去查了

DataFrame teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19");

5、保存Hive表

When working with a HiveContext, DataFrames can also be saved as persistent tables using the saveAsTable command

只有HiveContext生成的Dataframe才能调用saveAsTable去持久化hive表 

(三)、直接SQL操作
sqlContext.sql("create table xx.tmp like xx.xx");  
sqlContext.sql("insert into table xx.tmp partition(day=20160816) select * from xx.xx where day=20160816");  
sqlContext.sql("insert overwrite table xx.xx partition(day=20160816) select * from xx.tmp where day=20160816");  

转载于:https://www.cnblogs.com/xzjf/p/8857081.html

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- huatuo9.cn 版权所有 赣ICP备2023008801号-1

违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务