Talend Open Studio for ESB

Talend Open Studio for ESB

splash.jpg

Copyright (C) 2006-2016 Talend Inc. - www.talend.com

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

第1章 绪论 (1.1 研究背景与意义)

1.1 研究背景与意义
1.1.1 研究背景
日志在计算机系统中是一个非常广泛的概念,日志分析是针对不同业务系统产生的大量日志文件进行统计,归类的工作,目前工作中所涉及到的日志来源主要包括数据集成系统在数据集成过程中产生的日志文件和Web系统后台程序生成的日志文件,在工作中涉及对大量不同系统日常运行产生的日志数据进行分析、集成等工作,需要通过模糊匹配算法将日志记录按关键词分组,之后进而按不同类型提取其中时间、报警等级、描述信息等的各种用价值信息,从而分析业务系统存在的问题,并根据分析结果制作报表。目前在日志数据分析过程中缺乏符合实际业务需求的数据自动化分析技术,手工处理率较高,针对大量日志数据没有使用高效分布式运算技术,使得数据集成工作效率低下。因此,本论文的主要解决的问题就是利用开源领域数据分析及可视化技术,研发出一套日志数据分析系统软件,能够使日常数据分析处理工作实现自动化,对于数据分析结果能够以图表等可视化形式展现处理,并以通用性为出发点,将研究成果能够为今后更多的日志分析处理工作所采用。

1.1.2 研究意义
本文中所设计的日志分析系统主要用于对当前数据集成系统日志文件的分析工作,在当前数据集成系统中不同数据处理软件在运行过程中会产生大量日志文件,其中包含了源数据的各类异常信息与数据质量校验信息,由于数据量较大,所以在进行具体类型的分析之前需要对日志数据进行预处理和分类,本文将当前通用的数据模糊查询技术运用在日志分析中,对日志记录进行匹配计算,进而根据计算结果分组,并将此类算法从传统的单台主机运行模式移植到并行计算技术实现的分布式日志分析平台上来。
本次研究的意义是能够使日常日志类型的数据分析处理工作实现自动化,利用Hadoop分布式平台提高数据处理效率,在对于大量的不同来源的日志数据进行分析、集成等工作中结合实际业务需求,实现日志数据自动分组,提高日志文件处理效率,对于数据分组结果能够以图表等可视化形式展现,并以通用性为出发点,将研究成果能够为今后更多的分析处理工作所采用。

DESIGN AND IMPLEMENTATION OF LOG ANALYSIS SOFTWARE SYSTEM BASED ON OPEN-SOURCE(Abstract)

  • Abstract

In the information system, log data is automatically created and maintained by operation system and software server which maintain a history of running status. A statistical analysis of the log file might be used to examine and collect the discrepancies of the system. As the increment of the data size, a single host has been unable to meet the storage and computing requirements. How to improve performance effectively of log data analysis and to provide a virtualization result to the end-user is the research requirements of the dissertation. By the comparison of different open-source data analysis implementation, this paper presents a data quality system under the Hadoop's cluster framework based on cloud computing.
Hadoop distributed computing framework as it's scalable, economical, efficient, reliable, has been widely used in web search engine, log analysis, data mining areas. This dissertation starts to introduce the business background of log statistics, including the analysis method and the expectant result. The dissertation also presents some distributed data fuzzy matching methods to process in the log data in cloud computing environment, the fuzzy matching algorithms include Levenshtein, Soundex, and Jaro-Winkler could calculate the similarity distances for each log recorder, by the record linkage technologies the log data can be grouped as its similarity distance. Finally, the data virtualization model is also designed and implemented base on SpagoBI open-source framework, it uses OLAP (on-line analytical processing) data analysis features which integrated in SpagoBI, and user could acquire a graphical statistical result after computing.
This dissertation introduces the design and implementation of the data quality software system, includes system design and the implement of each models.

  • Keywords
    Log analysis;Hadoop frame;Map/Reduce;Data visualization