University of Louisville
A Cloud-Based Framework for Web Usage Mining
Institution
University of Louisville
Faculty Advisor/ Mentor
Carlos Rojas; Olfa Nasraoui
Abstract
Web logs from web servers can be analyzed to reveal web usage profiles, page similarities, and other information gleaned from web mining processes. However, web logs must be preprocessed so that the data mining algorithms that work on them can have clean and well formatted data. Some of the preprocessing tasks include filtering requests from spider bots or search engine crawlers and sessionization to group separate requests by the same user into a single session. Unfortunately, preprocessing can be time consuming, especially if it is done on only one computer. The objective of this project is to use distributed computing to significantly decrease the time needed to perform preprocessing of a web log into a specially formatted file for input into existing web mining algorithm implementations. To accomplish this, we use an open source implementation of the MapReduce algorithm, Hadoop, which currently runs on a four node cluster of regular (recycled) workstations. Using a cluster to complete preprocessing is an essential step towards moving data mining and analysis towards generating real time results. Lessons learned here can be translated into performance gains in other uses of computation. Although our current implementation is still at a small scale, it can be extended to a much larger scale based on the same framework, leading to a genuine cloud computing based web usage mining.
A Cloud-Based Framework for Web Usage Mining
Web logs from web servers can be analyzed to reveal web usage profiles, page similarities, and other information gleaned from web mining processes. However, web logs must be preprocessed so that the data mining algorithms that work on them can have clean and well formatted data. Some of the preprocessing tasks include filtering requests from spider bots or search engine crawlers and sessionization to group separate requests by the same user into a single session. Unfortunately, preprocessing can be time consuming, especially if it is done on only one computer. The objective of this project is to use distributed computing to significantly decrease the time needed to perform preprocessing of a web log into a specially formatted file for input into existing web mining algorithm implementations. To accomplish this, we use an open source implementation of the MapReduce algorithm, Hadoop, which currently runs on a four node cluster of regular (recycled) workstations. Using a cluster to complete preprocessing is an essential step towards moving data mining and analysis towards generating real time results. Lessons learned here can be translated into performance gains in other uses of computation. Although our current implementation is still at a small scale, it can be extended to a much larger scale based on the same framework, leading to a genuine cloud computing based web usage mining.