HIVE

2024-07-14 21:16| 来源: 网络整理| 查看: 265

文章目录 hive常用交互命令`-e`执行sql`-f`执行脚本中sql语句hive cli命令行窗口操作hdfs查看hive中输入的所有历史命令库创建库查看库使用库修改库删除库表查看表创建表语法管理表和外部表示例修改表删除表分区表概念操作分区表和数据产生关联方式一：上传数据后修复方式二：上传数据后添加分区方式三：创建文件夹后load数据到分区 DML数据操作数据导入向表中导入数据(load data)语法查询插入数据(insert... select)查询创建表并加载数据（as select）创建表时location指定加载数据路径(location)import数据到指定hive表中数据导出insert导出hadoop -get命令导出本地hive -e命令导出到本地hive export导出到hdfs 查询列别名运算符常用函数where语句示例比较运算符like和rlike逻辑运算符分组group by语句having语句join语句排序全局排序（order by）按照别名排序多列排序区内排序（sort by ）分区排序（distribute by）cluster by

hive常用交互命令 -e执行sql [linux01@test hive$] bin/hive -e "select * from tab_01;" -f执行脚本中sql语句编写sql脚本 [linux01@test hive$] touch /data/test/hive-f-test.sql

在脚本中编写sql。 2. 执行sql脚本

[linux01@test hive$] bin/hive -f /data/test/hive-f-test.sql 执行sql脚本并将结果写入另一个文件 [linux01@test hive$] bin/hive -f /data/test/hive-f-test.sql > /data/test/data/hive-f-test-result.txt hive cli命令行窗口操作hdfs hive> dfs -ls /user/hive/warehouse/; 查看hive中输入的所有历史命令 [linux01@test $] cat ~/.hivehistory 库创建库 hive> create database if not exists db01; 查看库查看所有库 hive> show databases; 查看数据库信息 hive> desc database db01; 查看数据库详细信息 hive> desc database extended db01; 使用库 hive> use db01; 修改库

可以使用alter database命令为某个数据库的DBPROPERTIES设置键-值对属性值，用于描述数据库的属性信息。数据库的其他元数据信息无法更改，包括数据库名和数据库所在目录位置location。

hive> alter database hive set dbproperties('createtime'='20220101'); 删除库删除空数据库 hive> drop database db01; 删除不存在数据库 hive> drop database if exists db01; 删除不为空的数据库 hive> drop database db01 cascade; 表查看表查看所有表 hive> show tables; 查看表结构 hive> show create table tab_01; 查看表类型 hive> desc formatted tab_01; 创建表语法 CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path]

常用简化：

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [ROW FORMAT row_format] [LOCATION hdfs_path]

字段说明：

CREATE TABLE：创建一个指定名称的表，若相同名字的表已经存在，则抛出异常，用户可使用IF NOT EXISTS选项忽略这个异常。EXTERNAL：关键字创建一个外部表，在建表的同时指定一个指向实际数据的路径LOCATION，hive创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不会对数据的位置做任何改变。在删除表时，内部表的元数据和数据会被一起删除；而外部表只删除元数据，不删除数据。COMMENT：为表和列添加注释。PARTITIONED BY：创建分区表。CLUSTERED BY：创建分桶表。SORTED BY：桶内排序。ROW FORMAT：DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] | SERDE serde_name [WITH SERDEROPERTIES (property_name=property_value, property_name=property_value, ...] 用户在建表的时候可以自定义SerDe（Serialize/Deserialize，序列化和反序列化）或者使用自带的SerDe，若没有指定ROW FORMAT 或者ROW FORMAT DELIMITED，将会使用自带的SerDe，在建表的时候，用户还需要为表指定列，用户在指定表的列同时也会指定自定义的SerDe。hive通过SERDE确定表的具体的列的数据。STORED AS：指定存储文件的类型，常见类型有：SEQUENCEFILE（二进制序列文件）、TEXTFILE（文本）、RCFILE（列式存储格式文件）.若文本数据是纯文本，可以使用STORED AS TEXTFILE，若需要压缩存储，可以使用STORED AS SEQUENCEFILE。LOCATION：指定表在HDFS上的存储位置。 10.LIKE：允许用户复制现有的表结构，但是不复制数据。管理表和外部表

默认创建的表都是管理表，也称为内部表，hive会控制数据的声明周期，不适合和其他工具共享数据。默认情况下会将这些表的数据存储在hive.metastore.warehouse.dir配置项定义的目录的子目录下，如/user/hive/warehouse，当删除一个管理表时，hive也会删除这个表中的数据，即hdfs location的数据也会一并删除。外部表使用EXTERNAL关键字进行创建。若创建外部表，仅记录数据所在的路径，不会对数据的位置做任何改变。在删除表时，内部表的元数据和数据会被一起删除；而外部表只删除元数据，不删除数据。

示例创建分区表 hive> create table if not exists tab_01 (colume01 string) partitioned by (colume02 string) row format delimited fields terminated by '\t'; 创建外部表 hive> create external table if not exists db01.tab_01( id int, name string ) row format delimited fields terminated by '\t';

装载数据进表中

hive> load data local inpath '/xxx/xxx/xxx.txt' into table db01.tab_01;

删除数据表

hive> drop table db01.tab_01;

此时location中的数据还是存在，即删除外部表后，元数据删除，但数据不删除。

修改表修改内部表为外部表 hive> alter table tab_01 set tblpropertites('EXTERNAL'='TRUE'); 修改外部表为内部表 hive> alter table tab_01 set tblpropertites('EXTERNAL'='FALSE'); 重命名表 hive> alter table tab_01 rename to tab_01_new; 增加列信息 hive> alter table tab_01 add columns(column02 string); 更新列 hive> alter table tab_01 change column column02 column02_new string; hive> alter table tab_01 change column column02_new column02_new_new int; 替换列不会修改存储在hdfs中的数据，只是改元数据的列而已。若hdfs中存储的是string类型，若列replace列为int后，则查不了对应的数据。 hive> alter table tab_01 replace columns (column03 int); 删除表删除表 hive> drop table tab_01; 清空表只能清空管理表，不清楚外部表。 hive> truncate table tab_01; 分区表概念

分区表实际上是对应一个HDFS文件系统上的独立文件夹，该文件夹是该分区所有的数据文件。hive中的分区就是分目录，把一个大的数据集切割成多个小的数据集，在查询时可以通过WHERE选定指定的分区查询对应的数据。

操作创建分区表 hive> create table if not exists tab_01 (column01 string) partitioned by (partition_column01 string) row format delimited fields terminated by '\t'; 查看分区信息 hive> show partitions tab_01; 查看分区表结构 hive> desc formatted tab_01; 插入分区数据 hive> insert into table tab_01 partition(partition_column01='xxxx') values ('yyyy'); load数据到分区 hive> load data local inpath '/data/xxx/xx/yy.txt' into table tab_01 partition(column02='xxxxxx'); 查看分区数据 hive> select * from tab_01 where partition_column01='xxxx'; 增加单个分区 hive> alter table tab_01 add partition(partition_column01='yyyy'); 增加多个分区 hive> alter table tab_01 add partition(partition_column01='xxx') partition(partition_column01='yyy'); 删除单个分区 hive> alter table tab_01 drop partition(partition_column01='yyyy'); 删除多个分区 hive> alter table tab_01 drop partition(partition_column01='xxx'), partition(partition_column01='yyy'); 创建二级分区 hive> create table tab_01(column01 string, column02 int ) partitioned by (partition_column01 string, partition_column02 string) row format delimited fields terminated by '\t'; 加载数据到二级分区表 hive> load data local inpath '/data/xxx/yy.txt' into table tab_01 partition(partition_column01='xxxx', partition_column02='yyyy'); 查询二级分区表数据 hive> select * from tab_01 where partition_column01='xxxx' and partition_column02='yyyy'; 分区表和数据产生关联方式一：上传数据后修复上传数据 [linux01@test $] hdfs dfs -mkdir -p /user/hive/warehouse/tab_01/month=202205/day=7; [linux01@test $] hdfs dfs -put /data/t.txt /user/hive/warehouse/tab_01/month=202205/day=7; 查询数据 hive> select * from tab_01 where month='202205' and day='7';

查询不到数据，因为该表只是有实际数据，但是无元数据。 3. 执行修复命令

hive> msck repair table tab_01; 再次查询数据 hive> select * from tab_01 where month='202205' and day='7';

查询到数据。

方式二：上传数据后添加分区上传数据 [linux01@test $] hdfs dfs -mkdir -p /user/hive/warehouse/tab_01/month=202205/day=8; [linux01@test $] hdfs dfs -put /data/t.txt /user/hive/warehouse/tab_01/month=202205/day=8; 增加分区 hive> alter table tab_01 add partition(month=202205,day=8); 查询数据 hive> select * from tab_01 where month='202205' and day='8'; 方式三：创建文件夹后load数据到分区创建目录 [linux01@test $] hdfs dfs -mkdir -p /user/hive/warehouse/tab_01/month=202205/day=9; 上传数据 hive> load data local inpath '/data/t.txt' into table tab_01 partition(month='202205',day='9'); 查询数据 hive> select * from tab_01 where month='202205' and day='9'; DML数据操作数据导入向表中导入数据(load data) 语法 load data [local] inpath '/xxx/xxx/xx.txt' [overwrite] into table tab_01 [partition(part_col1=val1,...)];

其中:

load data: 表示加载数据。local：表示从本地加载数据到hive表，否则从hdfs加载数据到hive表中。inpath：表示加载数据的路径。overwrite：表示覆盖原表中的数据，若不加该关键字，则表示追加数据。into table：表示加载到目标表。tab_01：表示加载到的目标表。partition：表示加载到表中的哪个分区。查询插入数据(insert… select) 创建分区表 hive> create table tab_01(id int, name string) > partitioned by (month string) > row format delimited fields terminated by '\t'; 基本插入数据 hive> insert into table tab_01 partition (month='202205') values (1, 'xiaoming'); 基本插入模式（单张表查询插入） hive> insert overwrite table tab_01 partition(month='202206') > select id, name from tab_01 where month='202205'; 多插入模式（多张表查询插入） hive> from tab_01 > insert overwrite table tab_01 partition(month='202207') > select id, name from tab_01 where month='202205' > insert overwrite table tab_01 partition(month='202208') > select id, name from tab_01 where month='202205'; 查询创建表并加载数据（as select） hive> create table if not exists tab_02 > as select id from tab_01; 创建表时location指定加载数据路径(location) 创建表并指定hdfs路径 hive> create table if not exists tab_02( > id int, name string > ) > row format delimited fields terminated by '\t' > location '/user/hive/warehouse/tab_02'; 上传数据到hdfs路径中 [linux01@test $] hdfs dfs -put /xxx/xxx/xx.txt /user/hive/warehouse/tab_02 查询数据 hive> select * from tab_02; import数据到指定hive表中

需要先将数据export后才能import进hive表中。

hive> import table tab_01 partition(month='202205') > from > '/user/hive/warehouse/export/tab_01'; 数据导出 insert导出将查询结果导出到本地 hive> insert overwrite local directory > '/data/hive/export/tab_01' > select * from tab_01; 将查询的结果格式化导出到本地 hive> insert overwrite local directory > '/data/hive/export/tab_01' > row format delimited fields terminated by '\t' > select * from tab_01; 将查询的结果导出到hdfs上 hive> insert overwrite directory > '/user/user01/export/tab_01' > row format delimited fields terminated by '\t' > select * from tab_01; hadoop -get命令导出本地 [linux01@test $] hdfs dfs -get /user/hive/warehouse/tab_01/month=202205/day=8 /data/test/export/t.txt hive -e命令导出到本地 [linux01@test hive$] bin/hive -e 'select * from db01.tab_01;' > /data/test/export/t.txt hive export导出到hdfs hive> export table tab_01 to '/data/test/export/tab_01' 查询列别名 hive> select name cn_name from tab_01; hive> select name as cn_name from tab_01; 运算符运算符说明A+BA加BA-BA减BA*BA乘以BA/BA除以BA%BA对B取余A&BA和B按位取与A|BA和B按位取或A^BA和B按位取异或，相同即为0，不同即为1~AA按位取反 hive> select num + 100 from tab_01; 常用函数计数（count） hive> select count(*) cnt from tab_01; 最大值（max） hive> select max(num) max_num from tab_01; 最小值（min） hive> select min(num) min_num from tab_01; 总和（sum） hive> select sum(num) sum_num from tab_01; 平均值（avg） hive> select avg(num) avg_num from tab_01; limit语句 hive> select * from tab_01 limit 10; where语句示例 hive> select * from tab_01 where id

【本文地址】

HIVE

HIVE

今日新闻

推荐新闻