With the rapid development of World Wide Web(WWW) technology and its ever-growing popularity, it has become increasingly necessary for the web designers to discover and analyze useful information from the WWW. Web data, referred to as the logfile, is a file that gives a detailed accounting of who accesses the web site, what pages were requested and in what order, and how long each page was viewed. In this manner, a log file has a lot of information.
However, logfiles are not only unstructured but also distorted in many cases. Especially, logfiles are seriously distorted when web pages are requested by users routed through the proxy cache server. Therefore, preparative processing is necessary prior to the discovery and analysis of meaningful information. Data preparation process consists of data cleaning, user identification, session identification, path completion and formatting.
This thesis first describes the concept and definition of web mining, and explains the individual components of web mining. Then an algorithm is developed to identify users who are routed through the proxy cache servers. Finally, the proposed algorithm is evaluated by experiments.
The experiments were conducted using groups of 2 or 3 people. The experimental results show the restoration ratio of 78% and error ratio of 4.1% on average, which indicates that the proposed algorithm can be used as a reasonable tool for identifying the users routed through the proxy cache servers.