本文共 5936 字,大约阅读时间需要 19 分钟。
本文隶属于专栏《1000个问题搞定大数据技术体系》,该专栏为笔者原创,引用请注明来源,不足和错误之处请在评论区帮忙指出,谢谢!
本专栏目录结构和参考文献请见
关于 SEQUENCEFILE 的更多内容请参考我的博客——
关于 ORC 的更多内容请参考我的博客—— 关于 PARQUET 的更多内容请参考我的博客—— 关于 AVRO 的更多内容请参考我的博客——
存储格式 | 描述 |
---|---|
STORED AS TEXTFILE | 存储为纯文本文件。TEXTFILE是默认文件格式,除非配置参数 hive.default.fileformat 有不同的设置。使用 DELIMITED 子句读取分隔文件。使用“ESCAPED BY”子句(例如ESCAPED BY ‘’)为分隔符启用转义如果您想处理包含这些分隔符的数据,则需要逃脱。也可以使用“NULL DEFINED AS”子句指定自定义NULL格式(默认值为“\N”)。(Hive4.0)表中的所有二进制列都假定为base64编码。要将数据读作原始字节:TBLPROPERTIES(“hive.serialization.decode.binary.as.base64"=“false”) |
STORED AS SEQUENCEFILE | 作为压缩序列文件存储。 |
STORED AS ORC | 以ORC文件格式存储。支持ACID事务和基于成本的优化器(CBO)。存储列级元数据。 |
STORED AS PARQUET | 在Hive 0.13.0及更高版本中 Stored as Parquet format for the Parquet columnar storage format;在Hive 0.10,0.11或0.12中语法是 Use ROW FORMAT SERDE … STORED AS INPUTFORMAT … OUTPUTFORMAT |
STORED AS AVRO | 以Hive 0.14.0及更高版本存储为Avro格式 |
STORED AS RCFILE | 以记录列文件格式存储。 |
STORED AS JSONFILE | 以Hive 4.0.0及更高版本存储为Json文件格式。 |
STORED BY | 以非本地表格格式存储。创建或链接到非本机表,例如由HBase或Druid或Accumulo支持的表。 |
INPUTFORMAT and OUTPUTFORMAT | 在file_format中,将相应的InputFormat和OutputFormat类的名称指定为字符串文本。例如,‘org.apache.hadoop.hive.contrib.fileformat.base64。Base64TextInputFormat’。对于LZO压缩,要使用的值是’INPUTFORMAT “com.hadoop.mapred.DeprecatedLzoTextInputFormat”,输出格式“org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat” |
log.txt 大小为18.1 M
create table log_text (track_time string,url string,session_id string,referer string,ip string,end_user_id string,city_id string)row format delimited fields terminated by '\t'stored as TEXTFILE ;
load data local inpath '/home/hadoop/log.txt' into table log_text ;
dfs -du -h /user/hive/warehouse/log_text;+------------------------------------------------+--+| DFS Output |+------------------------------------------------+--+| 18.1 M /user/hive/warehouse/log_text/log.txt |+------------------------------------------------+--+
create table log_parquet (track_time string,url string,session_id string,referer string,ip string,end_user_id string,city_id string)row format delimited fields terminated by '\t'stored as PARQUET;
insert into table log_parquet select * from log_text;
dfs -du -h /user/hive/warehouse/log_parquet;+----------------------------------------------------+--+| DFS Output |+----------------------------------------------------+--+| 13.1 M /user/hive/warehouse/log_parquet/000000_0 |+----------------------------------------------------+--+
create table log_orc (track_time string,url string,session_id string,referer string,ip string,end_user_id string,city_id string)row format delimited fields terminated by '\t'stored as ORC ;
insert into table log_orc select * from log_text ;
dfs -du -h /user/hive/warehouse/log_orc;+-----------------------------------------------+--+| DFS Output |+-----------------------------------------------+--+| 2.8 M /user/hive/warehouse/log_orc/000000_0 |+-----------------------------------------------+--+
ORC > PARQUET > TEXTFILE
select count(*) from log_text;+---------+--+| _c0 |+---------+--+| 100000 |+---------+--+1 row selected (16.99 seconds)
select count(*) from log_parquet;+---------+--+| _c0 |+---------+--+| 100000 |+---------+--+1 row selected (17.994 seconds)
select count(*) from log_orc;+---------+--+| _c0 |+---------+--+| 100000 |+---------+--+1 row selected (15.943 seconds)
ORC > TEXTFILE > PARQUET
使用压缩的优势是可以最小化所需要的磁盘存储空间,以及减少磁盘和网络io操作
ORC支持三种压缩:ZLIB,SNAPPY,NONE。最后一种就是不压缩,ORC默认采用的是ZLIB压缩。
create table log_orc_none (track_time string,url string,session_id string,referer string,ip string,end_user_id string,city_id string)row format delimited fields terminated by '\t'stored as ORC tblproperties("ORC.compress"="NONE") ;
insert into table log_orc_none select * from log_text ;
dfs -du -h /user/hive/warehouse/log_orc_none;+----------------------------------------------------+--+| DFS Output |+----------------------------------------------------+--+| 7.7 M /user/hive/warehouse/log_orc_none/000000_0 |+----------------------------------------------------+--+
create table log_orc_snappy (track_time string,url string,session_id string,referer string,ip string,end_user_id string,city_id string)row format delimited fields terminated by '\t'stored as ORC tblproperties("ORC.compress"="SNAPPY") ;
insert into table log_orc_snappy select * from log_text ;
dfs -du -h /user/hive/warehouse/log_orc_snappy;+------------------------------------------------------+--+| DFS Output |+------------------------------------------------------+--+| 3.8 M /user/hive/warehouse/log_orc_snappy/000000_0 |+------------------------------------------------------+--+
dfs -du -h /user/hive/warehouse/log_orc;+-----------------------------------------------+--+| DFS Output |+-----------------------------------------------+--+| 2.8 M /user/hive/warehouse/log_orc/000000_0 |+-----------------------------------------------+--+
转载地址:http://mzgji.baihongyu.com/