In the information system, log data is automatically created and maintained by operation system and software server which maintain a history of running status. A statistical analysis of the log file might be used to examine and collect the discrepancies of the system. As the increment of the data size, a single host has been unable to meet the storage and computing requirements. How to improve performance effectively of log data analysis and to provide a virtualization result to the end-user is the research requirements of the dissertation. By the comparison of different open-source data analysis implementation, this paper presents a data quality system under the Hadoop's cluster framework based on cloud computing.
Hadoop distributed computing framework as it's scalable, economical, efficient, reliable, has been widely used in web search engine, log analysis, data mining areas. This dissertation starts to introduce the business background of log statistics, including the analysis method and the expectant result. The dissertation also presents some distributed data fuzzy matching methods to process in the log data in cloud computing environment, the fuzzy matching algorithms include Levenshtein, Soundex, and Jaro-Winkler could calculate the similarity distances for each log recorder, by the record linkage technologies the log data can be grouped as its similarity distance. Finally, the data virtualization model is also designed and implemented base on SpagoBI open-source framework, it uses OLAP (on-line analytical processing) data analysis features which integrated in SpagoBI, and user could acquire a graphical statistical result after computing.
This dissertation introduces the design and implementation of the data quality software system, includes system design and the implement of each models.
Log analysis；Hadoop frame；Map/Reduce；Data visualization