Apache Griffin编译安装二 模拟数据

分类: JAVA 评论(6) 119 阅读 2021-01-15 17:06 ZooM查看

摘要 上一篇完成了Griffin的编译安装可正常访问到WEB UI界面,在测试模拟数据时碰到了很多问题包括创建HIVE表、YARN问题等。
JAVA

  Apache Griffin大数据的开源数据质量解决方案源码编译部署安装

  上一篇完成了Griffin的编译安装运行,还未模拟数据进行测试,模拟步骤参照了:Apache Griffin 5.0 编译安装和使用(包含依赖无法下载的问题解决)


  第一步:创建数据库、创建数据表

# 登录到服务器 命令创建数据库hive -e "create database griffin_demo"
hive --database griffin_demo

创建数据表,按照上面博主里面的创建表语句一直通不过

hive> CREATE EXTERNAL TABLE `demo_src`(
    >   `id` bigint,
    >   `age` int,
    >   `desc` string)
    > PARTITIONED BY (
    >   `dt` string,
    >   `hour` string)
    > ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY '|'
    > LOCATION
    >   'hdfs://cdh6:8020/griffin/data/batch/demo_src';

hive> CREATE EXTERNAL TABLE `demo_tgt`(
    >   `id` bigint,
    >   `age` int,
    >   `desc` string)
    > PARTITIONED BY (
    >   `dt` string,
    >   `hour` string)
    > ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY '|'
    > LOCATION
    >   'hdfs://cdh6:8020/griffin/data/batch/demo_tgt';

指定 LOCATION 时 hive报错

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: org.apache.hadoop.ipc.RemoteException Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error

暂时未找到原因,所以把 LOCATION 部分去掉。注意要保留指定分隔符 不然会造成后面模拟数据不能插入进去。

  第二步:下载模拟数据

cd $GRIFFIN_HOME/data
wget http://griffin.apache.org/data/batch/gen_demo_data.sh
wget http://griffin.apache.org/data/batch/gen_delta_src.sh
wget http://griffin.apache.org/data/batch/demo_basic
wget http://griffin.apache.org/data/batch/delta_src
wget http://griffin.apache.org/data/batch/delta_tgt
wget http://griffin.apache.org/data/batch/insert-data.hql.template
#如果前面已经成功创建了 demo_src demo_tgt,可以不用下载这个脚本
#wget http://griffin.apache.org/data/batch/create-table.hql
chmod 755 *.sh
./gen_demo_data.sh

Windows平台直接打开链接保存就可以了。将模拟数据文件统一放到一个文件夹下


打开 insert-data.hql.template 文件,默认没有指定数据库名称需要自己添加上,同时数据源文件需要能定位到

[root@iot-200 data]# cat insert-data.hql.template
LOAD DATA LOCAL INPATH '/disk1/apps/griffin/data/demo_src' INTO TABLE griffin_demo.demo_src PARTITION (PARTITION_DATE);
LOAD DATA LOCAL INPATH '/disk1/apps/griffin/data/demo_tgt' INTO TABLE griffin_demo.demo_tgt PARTITION (PARTITION_DATE);

  第三步:创建 gen-hive-data.sh 用于插入数据

#!/bin/bash

#create table。因为前面已经手动创建,这里注释,不用再次创建了
#hive -f create-table.hql
echo "create table done"

#current hour
sudo ./gen_demo_data.sh
cur_date=`date +%Y%m%d%H`
dt=${cur_date:0:8}
hour=${cur_date:8:2}
partition_date="dt='$dt',hour='$hour'"
sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql
hive -f insert-data.hql
src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour}
hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}
hadoop fs -touchz ${src_done_path}
hadoop fs -touchz ${tgt_done_path}
echo "insert data [$partition_date] done"

#last hour
sudo ./gen_demo_data.sh
cur_date=`date -d '1 hour ago' +%Y%m%d%H`
dt=${cur_date:0:8}
hour=${cur_date:8:2}
partition_date="dt='$dt',hour='$hour'"
sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql
hive -f insert-data.hql
src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour}
hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}
hadoop fs -touchz ${src_done_path}
hadoop fs -touchz ${tgt_done_path}
echo "insert data [$partition_date] done"

#next hours
set +e
while true
do
  sudo ./gen_demo_data.sh
  cur_date=`date +%Y%m%d%H`
  next_date=`date -d "+1hour" '+%Y%m%d%H'`
  dt=${next_date:0:8}
  hour=${next_date:8:2}
  partition_date="dt='$dt',hour='$hour'"
  sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql
  hive -f insert-data.hql
  src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
  tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
  hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour}
  hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}
  hadoop fs -touchz ${src_done_path}
  hadoop fs -touchz ${tgt_done_path}
  echo "insert data [$partition_date] done"
  sleep 3600
done
set -e

  第四步:执行脚本插入数据

chmod +x gen-hive-data.sh
./gen-hive-data.sh

  第五步:查看数据

# 登录数据库[root@iot-200 data]# hive --database griffin_demo
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/hive-common-2.1.1-cdh6.3.2.jar!/hive-log4j2.properties Async: false
^[[A^[[A^[[A^[[AOK
Time taken: 2.134 seconds

WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> select * from demo_src limit 3;
OK
0       1       1       20210113        16
0       2       2       20210113        16
0       3       3       20210113        16
Time taken: 3.522 seconds, Fetched: 3 row(s)
hive>

模拟数据插入就OK了。

由于平台是CDH,在插入HIVE期间遇到了 yarn 报错

2021-01-13 10:26:49,048 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed for container_e20_161042
9316887_0003_01_000001
java.io.IOException: Application application_1610429316887_0003 initialization failed (exitCode=255) with output: main : command provided 0
main : run as user is hdfs
main : requested yarn user is hdfs
Requested user hdfs is not whitelisted and has id 372,which is below the minimum allowed 1000

问题原因:是由于Yarn限制了用户id小于1000的用户提交作业;
解决方法:修改为0



  新问题:

  模拟数据虽然已经完成导入,但是在Griffin的WEN界面上选择Schema时后台报错了:

1、org.apache.thrift.transport.TTransportException: null

2、Caused by: org.apache.thrift.transport.TTransportException: Invalid status -128

项目启动时会报错 org.apache.thrift.transport.TTransportException: null,在选择Schema时无法正常获取到元数据信息 Hive MetaStore 服务日志中报错:

Error occurred during processing of message.
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Invalid status -128
  at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
  at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:268)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.transport.TTransportException: Invalid status -128
  at org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
  at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:184)
  at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
  at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
  at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
  at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
  ... 4 more
评论6
评论已关闭