Starting namenodes on [localhost] Starting datanodes Starting secondary namenodes [sherry-HP-ENVY-Notebook-13-ab0XX]
1
$ hive
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Hive Session ID = e95e48ee-b6a7-482b-a68a-45a70597bc9a
Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-3.1.1.jar!/hive-log4j2.properties Async: true Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. hive>
1 Toy Story (1995) Animation|Children's|Comedy 2 Jumanji (1995) Adventure|Children's|Fantasy 3 Grumpier Old Men (1995) Comedy|Romance 4 Waiting to Exhale (1995) Comedy|Drama 5 Father of the Bride Part II (1995) Comedy 6 Heat (1995) Action|Crime|Thriller 7 Sabrina (1995) Comedy|Romance 8 Tom and Huck (1995) Adventure|Children's 9 Sudden Death (1995) Action 10 GoldenEye (1995) Action|Adventure|Thriller
Top 10 viewed
1 2 3 4 5 6 7 8
SELECT movies.MovieID, movies.Title, COUNT(DISTINCT ratings.UserID) as views FROM movies JOIN ratings ON (movies.MovieID = ratings.MovieID) GROUPBY movies.MovieID, movies.Title ORDERBY views DESC LIMIT10;
Execution completed successfully MapredLocal task succeeded Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2019-03-26 11:21:26,647 Stage-2 map = 0%, reduce = 0% 2019-03-26 11:21:32,812 Stage-2 map = 100%, reduce = 0% 2019-03-26 11:21:34,828 Stage-2 map = 100%, reduce = 100% Ended Job = job_local1582945428_0001 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2019-03-26 11:21:36,162 Stage-3 map = 100%, reduce = 100% Ended Job = job_local901668662_0002 MapReduce Jobs Launched: Stage-Stage-2: HDFS Read: 43211608 HDFS Write: 0 SUCCESS Stage-Stage-3: HDFS Read: 43211608 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 2858 American Beauty (1999) 3428 260 Star Wars: Episode IV - A New Hope (1977) 2991 1196 Star Wars: Episode V - The Empire Strikes Back (1980) 2990 1210 Star Wars: Episode VI - Return of the Jedi (1983) 2883 480 Jurassic Park (1993) 2672 2028 Saving Private Ryan (1998) 2653 589 Terminator 2: Judgment Day (1991) 2649 2571 Matrix 2590 1270 Back to the Future (1985) 2583 593 Silence of the Lambs 2578 Time taken: 20.208 seconds, Fetched: 10 row(s)
-- Reuslt from mysql +---------+-------------------------------------------------------+-------+ | MovieID | Title | views | +---------+-------------------------------------------------------+-------+ | 2858 | American Beauty (1999) | 3428 | | 260 | Star Wars: Episode IV - A New Hope (1977) | 2991 | | 1196 | Star Wars: Episode V - The Empire Strikes Back (1980) | 2990 | | 1210 | Star Wars: Episode VI - Return of the Jedi (1983) | 2883 | | 480 | Jurassic Park (1993) | 2672 | | 2028 | Saving Private Ryan (1998) | 2653 | | 589 | Terminator 2: Judgment Day (1991) | 2649 | | 2571 | Matrix | 2590 | | 1270 | Back to the Future (1985) | 2583 | | 593 | Silence of the Lambs | 2578 | +---------+-------------------------------------------------------+-------+ 10 rows in set (4min8.42 sec)
--Result from Hive 2858 American Beauty (1999) 3428 260 Star Wars: Episode IV - A New Hope (1977) 2991 1196 Star Wars: Episode V - The Empire Strikes Back (1980) 2990 1210 Star Wars: Episode VI - Returnof the Jedi (1983) 2883 480 Jurassic Park (1993) 2672 2028 Saving Private Ryan (1998) 2653 589 Terminator 2: Judgment Day (1991) 2649 2571 Matrix 2590 1270 Back to the Future (1985) 2583 593 Silence of the Lambs 2578 Time taken: 20.208seconds, Fetched: 10row(s)
创建分区表
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
--创建新的分区表 CREATETABLE users1(userid INT, Age INT, Occupation INT, ZIP INT) PARTITIONED BY (gender STRING);
--设置动态分区 set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict;
SELECT movies_bucketed.Movieid, movies_bucketed.Title, COUNT(DISTINCT ratings_bucketed.Userid) AS views FROM movies_bucketed JOIN ratings_bucketed ON (movies_bucketed.Movieid = ratings_bucketed.Movieid) GROUPBY movies_bucketed.Movieid, movies_bucketed.Title ORDERBY views DESC LIMIT10;
Time taken: 17.914 seconds, Fetched: 10 row(s)
1
SELECT * FROM movies_bucketed TABLESAMPLE(BUCKET3OUTOF32);
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1312 Female Perversions (1996) Drama 2336 Elizabeth (1998) Drama 1504 Hollow Reed (1996) Drama 480 Jurassic Park (1993) Action|Adventure|Sci-Fi 1633 Ulee's Gold (1997) Drama 961 Little Lord Fauntleroy (1936) Drama 865 Small Faces (1995) Drama 3329 Year My Voice Broke The (1987) 866 Bound (1996) Crime|Drama|Romance|Thriller 3010 Rosetta (1999) Drama 3811 Breaker Morant (1980) Drama|War 3076 Irma la Douce (1963) Comedy 2756 Wanted: Dead or Alive (1987) Action 2212 Man Who Knew Too Much The (1934) ……
Time taken: 0.989 seconds, Fetched: 109 row(s)
Hive中的Join策略
Join策略类型
方法
优点
缺点
Shuffle Join
通过对数据进行重新排序与map reduce实现连接操作
不需要考虑数据规模
速度慢,消耗的资源多
Broadcast Join
小的数据表会被加载到所有节点的内存中,mapper会扫描大的数据表并进行连接
非常快,只需要对大的数据表进行一次扫描
较小的数据表的规模必须足够小,能够被内存容纳
Sort-Merge-Bucket Join
mapper利用连接键的相同同分布进行高效的连接
对于任何规模的数据表都非常快
数据必须预先进行排序和分桶
img
135014_ekR7_568818.png
135056_wVBt_568818.png
参考文献
[1] Thusoo A , Sarma J S , Jain N , et al. Hive - A Warehousing Solution Over a Map-Reduce Framework[J]. Proceedings of the VLDB Endowment, 2009, 2(2):1626-1629.
[2] Thusoo A , Sarma J S , Jain N , et al. Hive – A Petabyte Scale Data Warehouse Using Hadoop[J]. 2010.