1. 采集数据_项目一

发布于 2022年 08月 09日 00:37

说明

  1. 用户点击页面后数据存储到a.log文件中。(本项目省去了这一步,数据已经在a.log中了)
  2. 使用java代码将a.log文件中的数据,写入project.log中。
  3. 使用flume采集日志,监控project.log文件内容的变化,将新增的用户的数据写出到hdfs上。

a.log中的现成数据

点击查看代码
120.191.181.178 - - 2018-02-18 20:24:39 "POST https://www.taobao.com/item/b HTTP/1.1" 203 69172 https://www.taobao.com/register UCBrowser Webkit X3android 8.0 海南 20.02 110.20 36
149.74.183.133 - - 2018-09-24 19:38:17 "GET https://www.taobao.com/register HTTP/1.0" 300 72815 https://www.taobao.com/item/a Windows Internet Explorer Tridentwindows 广西 22.48 108.19 39
58.9.92.122 - - 2018-08-30 11:28:15 "GET https://www.taobao.com/list/ HTTP/1.0" 203 17119 https://www.taobao.com/category/a OPera blinkwindows 重庆 29.59 106.54 35
77.56.72.210 - - 2018-05-13 18:11:22 "POST https://www.taobao.com/category/b HTTP/1.0" 201 17843 https://www.taobao.com/list/ Firefox ios 吉林 43.54 125.19 57
217.147.196.74 - - 2018-08-22 18:06:01 "GET https://www.taobao.com/category/c HTTP/1.0" 501 95033 https://www.taobao.com/category/b Mozilla Firefox Geckowindows 山东 36.40 117.00 28
37.146.124.65 - - 2018-05-08 02:12:24 "POST https://www.taobao.com/category/d HTTP/1.1" 203 47329 https://www.taobao.com/item/a Safari webkitwindows 江西 28.40 115.55 25
167.108.24.171 - - 2018-08-20 12:19:01 "POST https://www.taobao.com/recommand HTTP/1.1" 302 80056 https://www.taobao.com/category/d Google Chrome Chromium/BlinkMac OS X 广东 23.08 113.14 19
94.69.229.202 - - 2018-07-27 13:46:37 "GET https://www.taobao.com/recommand HTTP/1.1" 501 63116 https://www.taobao.com/recommand Windows Internet Explorer Tridentwindows 陕西 34.17 108.57 58

日志采集脚本的编写

  • 监控文件内容的变化,进而采集到数据
[root@node1 dataCollect]# cat dataCollect.conf
dataCollect.sources = s1
dataCollect.channels = c1
dataCollect.sinks = k1
dataCollect.sources.s1.type = exec
dataCollect.sources.s1.command = tail -F /opt/project/dataCollect/project.log
dataCollect.channels.c1.type = memory
dataCollect.channels.c1.capacity = 20000
dataCollect.channels.c1.transactionCapacity = 10000
dataCollect.channels.c1.byteCapacity = 1048576000
# sink info
dataCollect.sinks.k1.type = hdfs
#指定采集的数据存储到HDFS的哪个目录下
dataCollect.sinks.k1.hdfs.path = hdfs://node1:9000/project/%Y%m%d/
#上传文件的前缀
dataCollect.sinks.k1.hdfs.filePrefix = project-
#上传文件的后缀
dataCollect.sinks.k1.hdfs.fileSuffix = .log
##是否按照时间滚动文件夹
dataCollect.sinks.k1.hdfs.round = true
##多少时间单位创建一个新的文件夹
dataCollect.sinks.k1.hdfs.roundValue = 24
##重新定义时间单位
dataCollect.sinks.k1.hdfs.roundUnit = hour
##是否使用本地时间戳
dataCollect.sinks.k1.hdfs.useLocalTimeStamp = true
##积攒多少个Event才flush到HDFS一次
dataCollect.sinks.k1.hdfs.batchSize = 5000
##设置文件类型,可支持压缩
dataCollect.sinks.k1.hdfs.fileType = DataStream
##多久生成一个新的文件每隔6小时
dataCollect.sinks.k1.hdfs.rollInterval = 21600
##设置每个文件的滚动大小128M
dataCollect.sinks.k1.hdfs.rollSize = 134217700
##文件的滚动与Event数量无关
dataCollect.sinks.k1.hdfs.rollCount = 0
##最小冗余数 副本数
dataCollect.sinks.k1.hdfs.minBlockReplicas = 1
# Bind the source and sink to the channel
dataCollect.sources.s1.channels = c1
dataCollect.sinks.k1.channel = c1

使用java代码将用户行为数据写入project.log中

package com.sxuek;
import java.io.*;
import java.util.Random;
public class SimPro {
public static void main(String[] args) throws IOException, InterruptedException {
/*
每隔一段时间去读取/root/project/dataCollect/a.log文件中的一行或者多行数据
然后将数据追加到project.log文件
*/
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("/root/project/dataCollect/a.log"), "UTF-8"));
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("/root/project/dataCollect/project.log", true), "utf-8"));
/*
读取文件数据 模拟用户点击
*/
/*
随机生成两个数字,第一个数字时间间隔,第二个数字每一次点击的次数
*/
String line = null;
Random random = new Random();
while ((line = br.readLine()) != null) {
int time = random.nextInt(5000);
int count = 1 + random.nextInt(10000);
Thread.sleep(time); // 代表时间间隔
System.out.println("间隔了"+time+"时间,有"+count+"个用户点击了网站,产生了用户行为日志数据");
for (int i = 0; i < count; i++) {
bw.write(line);
bw.newLine();
bw.flush();
line = br.readLine();
}
}
}
}

打jar包上传,运行

开始采集

flume-ng agent -n dataCollect -f dataCollect.conf -Dflume.root.logger=INFO,console
java -jar untitled.jar SimPro.java

推荐文章