hive2

时间:2019-07-15 00:05:54   收藏:0   阅读:120

4.hive优化
1)跑sql的时候会出现的参数:

In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number> 
  如果大于<number>,就会多生成一个reduce
  <number> =1024    <1k 一个reduce
  1m 10个reduce
  
  set hive.exec.reducers.bytes.per.reducer=20000;
  select user_id,count(1) as order_cnt 
  from orders group by user_id limit 10;
--结果number of mappers: 1; number of reducers: 1009  
  
  
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
  set hive.exec.reducers.max=10;
 -- number of mappers: 1; number of reducers: 10 
 
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
  set mapreduce.job.reduces=5;
  --number of mappers: 1; number of reducers: 5
  set mapreduce.job.reduces=15;
  --number of mappers: 1; number of reducers: 15
  对你当前窗口,或者执行任务(脚本)过程中生效

  2)where条件使得group by冗余 
  map 和 reduce执行过程是一个同步的过程
  同步:打电话
  异步:发短信 
  1:map执行完 reduce在执行       1+2=3:reduce
  2:map reduce
  
  map 60%  reduce=3%
  
  3)只有一个reduce
  a.没有group by
  set mapreduce.job.reduces=5;
  select count(1) from orders where order_dow=0;
  --number of mappers: 1; number of reducers: 1
  b.order by
  set mapreduce.job.reduces=5;
  select user_id,order_dow 
  from orders where order_dow=0
  order by user_id
  limit 10;
  -- number of mappers: 1; number of reducers: 1
  c.笛卡尔积 cross product
  tmp_d
1
2
3
4
5
select * from tmp_d 
join (select * from tmp_d)t 
where tmp_d.user_id=t.user_id; --相当于on

join没有on的字段关联
1   1
2    1
3    1
1    2
2    2
3    2
1    3
2    3
3    3
user product(库中所有商品中调小部分觉得这个用户喜欢 召回(match) 候选集1000)  top10 
users 母婴类 products
要同时考虑users和products信息来给它们做一个筛选(粗粒度)

5)map join
select /*+ MAPJOIN(aisles) */ a.aisle as aisle,p.product_id as product_id 
from aisles a join product p 
on a.aisle_id=p.aisle_id limit 10;


dict  hashMap  {aisle_id : aisle}
for line in products:
    ss = line.split(\t)
    aisle_id = ss[0]
    product_id = ss[1]
    aisle = dict[aisle_id]
    print %s\t%s%(aisle,product_id)
    
6)union all + distinct   == union
--运行时间:74.712 seconds 2job
select count( *) c
from (
select order_id,user_id,order_dow from orders where order_dow=0 union all
select order_id,user_id,order_dow from orders where order_dow=0 union all 
select order_id,user_id,order_dow from orders where order_dow=1
)t;

--运行时间122.996 seconds 3 job
select *
from(
select order_id,user_id,order_dow from orders where order_dow=0 
union 
select order_id,user_id,order_dow from orders where order_dow=0
union 
select order_id,user_id,order_dow from orders where order_dow=1)t;

7) 
set hive.groupby.skewindata=true;
将一个map reduce拆分成两个map reduce
‘-’(‘’,-1,0,null)1亿条 到一个reduce上面,

1个reduce处理6000w ‘-1%     200w求和 =》1条
29 reduce处理剩余的4000w 99%

1.随机分发到不同的reduce节点,进行聚合(count2. 最终的一个reduce做最终结果的聚合(200w求和 =》1条)
    
select add_to_cart_order,count(1) as cnt 
from order_products_prior 
group by add_to_cart_order
limit 10; 


select user_id,count(1) as cnt 
from order_products_prior 
group by user_id
limit 10; 





    
-- 没指定set hive.groupby.skewindata=true;
--Launching Job 1 out of 1
-- 1m 41s

--指定了set hive.groupby.skewindata=true;
--Launching Job 1 out of 2
-- 2m 50s

如果在不导致reduce一直失败起不来的时候,就不用这个变量
如果确实出现了其中一个reduce的处理数据量太多,导致任务一直出问题,运行时间长。这种情况需要设置这个变量。

凌晨定时任务,近一周报表,跑了3个小时。
洗出来的基础表,3点出来,7点出来,后面接了70任务 
  
  8)MR的数量
 --Launching Job 1 out of 1
 select 
 ord.order_id order_id,
 tra.product_id product_id,
 pri.reordered reordered
from orders ord 
join train tra on ord.order_id=tra.order_id
join order_products_prior pri on ord.order_id=pri.order_id
limit 10;

--两个MR任务
 select 
 ord.order_id,
 tra.product_id,
 pro.aisle_id
from orders ord
join trains tra on ord.order_id=tra.order_id
join products pro on tra.product_id=pro.product_id
limit 10;

9/*+ STREAMTABLE(a) */ a是大表
类似map join 放到select中的,区别:它是指定大表
select /*+STREAMTABLE(pr)*/ ord.order_id,pr.product_id,pro.aisle_id
from orders ord
join order_products_prior pr on ord.order_id=pr.order_id
join products pro on pr.product_id=pro.product_id
limit 10;

10)LEFT OUTER JOIN
select od.user_id,
od.order_id,
tr.product_id
from
(select user_id,order_id,order_dow from orders limit 100)od
left outer join
(select order_id,product_id,reordered from train)tr 
on (od.order_id=tr.order_id and od.order_dow=0 and tr.reordered=1)
limit 30;

--join默认是inner

11)set hive.exec.parallel=true
1:map执行完 reduce在执行       1+2=3:reduce
2:map reduce


12)
1. - ,where age<>- 直接丢掉这个数据
select age,count(1) group by age where age<>-

1_-  2_- 3_-

怎么定位具体哪几个key发生倾斜?
sample 
SELECT COUNT(1) FROM (SELECT * FROM lxw1 TABLESAMPLE (200 ROWS)) x;
SELECT * FROM udata TABLESAMPLE (50 PERCENT);
select * from table_name where col=xxx order by rand() limit num; 
SELECT * FROM lxw1 TABLESAMPLE (30M);

长尾数据
 

 

原文:https://www.cnblogs.com/hejunhong/p/11186393.html

评论(0
© 2014 bubuko.com 版权所有 - 联系我们:wmxa8@hotmail.com
打开技术之扣,分享程序人生!