HIVE的分区表和分桶表

时间：2021-05-04 23:23:27 收藏：0 阅读：49

分区表

hive可以转化成MR计算程序，当数据量多时，读取一整个目录下的所有文件来进行计算，因为数据量太大，所以就会变得特别慢。
在实际工作当中，我们一般有计算前一天的数据的需求，我们可以将前一天的数据放在一个文件夹下，专门来计算前一天的数据
hive的分区表大概也是通过分文件夹的形式，将每一天数据都分成一个文件夹，然后去查询数据的时候就可以查询一个文件夹下的数据，
减小数据范围，加快查询效率

创建分区表语法

create table score(s_id string,c_id string,s_score int) partitioned by (month string) row format delimited fields terminated by ‘\t‘;

创建一个表带有多个分区

create table score2(s_id string,c_id string,s_score int)
partitioned by (year string,month string,day string)
row format delimited fields terminated by ‘\t‘;

加载数据到分区表当中去

load data local inpath ‘/bigdata/logs/score.csv‘ into table score partition(month=‘201806‘);

查看分区

show partitions score;

添加一个分区

alter table score add partition(month=‘201805‘);

同时添加多个分区

alter table score add partition(month=‘201804‘) partition(month=‘201803‘);

删除分区

alter table score drop partition(month=‘201806‘);

分桶表

分桶是相对分区进行更细粒度的划分，hive表或分区表可进一步分桶
原理：将整个数据内容按照某列取hash值，对桶的个数取模的方式决定该条记录存放在哪个桶中；具有相同hash值的数据进入到同一个桶，形成同一个文件
eg:比如按照name属性分3个桶，就是对name属性值的hash值对3取模，按照取模结果对数据分桶。
作用：提高某些查询操作效率

创建分桶表

set hive.enforce.bucketing=true;
set mapreduce.job.reduces=4;  ##分4个桶

##创建分桶表
create table user_buckets_demo(id int,name string)
clustered by(id) into 4 buckets
row format delimited fields terminated by ‘\t‘;

##创建普通表
create table user_demo(id int, name string)
row format delimited fields terminated by ‘\t‘;

准备数据文件user_bucket.txt

cd /bigdata/logs/
vi user_bucket.txt


1 anzhulababy1
2 AngleBaby2
3 xiaoxuanfeng1
4 heixuanfeng1
5 jingmaoshu1
6 dongjingshu1
7 dianwanxiaozi1
8 aiguozhe1

加载数据到普通表user_buckets_demo中

load data local inpath ‘/bigdata/logs/user_bucket.txt‘ 
overwrite into table user_buckets_demo;

hive3.x版本之后，可以直接向分桶表中load数据，不需要通过查询普通表数据插入到分桶表中。所以hive3.x以下版本插入数据需要两步
加载数据到普通表中
load data local inpath ‘/bigdata/logs/user_bucket.txt‘ overwrite into table user_demo;
从普通表中查询数据插入到分桶表中
insert into table user_bucket_demo select * from user_demo;

分区和分桶表案例

分区表
创建一个分区表，包括字段有(姓名，性别，年龄)，并以年龄作为分区

建表语句（分区字段不能和表字段同名）

create table if not exists student1(
id int,
name string,
sex string,
age int) 
partitioned by (age1 int) 
row format delimited fields terminated by ‘ ‘;

准备数据 t_student.txt，t_student1.txt

vi t_student.txt

zhangsan 男 20
lisi 男 20
wangwu 男 20

vi t_student1.txt

lilei 男 21
Lucy 女 21
hanmeimei 女 21

加载数据到分区表

load data local inpath ‘/bigdata/logs/t_student.txt‘ into table student1 partition (age=20) ;

load data local inpath ‘/bigdata/logs/t_student1.txt‘ into table student1 partition (age=21) ;

分桶表
创建一个分桶表，包括字段(姓名，性别，年龄)，并以性别分桶

建表语句

set hive.enforce.bucketing=true;
set mapreduce.job.reduces=2;
create table if not exists student2(
id int,
name string,
sex string,
age int)
clustered by(sex) into 2 buckets
row format delimited fields terminated by ‘ ‘;

准备数据 t_student2.txt

vi t_student2.txt

zhangsan 男 31
lisi 男 32
wangwu 男 33
lilei 男 35

加载数据到分桶表中

load data local inpath ‘/bigdata/logs/t_student2.txt‘ overwrite into table student2;

原文：https://www.cnblogs.com/tenic/p/14730455.html