Document Type

Article

Publication Date

6-2005

Publication Source

Proceedings of the 2005 International Conference on Data Mining, DMIN 2005

Abstract

Clustering is well-suited for Web mining by automatically organizing Web pages into categories, each of which contains Web pages having similar contents. However, one problem in clustering is the lack of general methods to automatically determine the number of categories or clusters. For the Web domain in particular, currently there is no such method suitable for Web page clustering. In an attempt to address this problem, we discover a constant factor that characterizes the Web domain, based on which we propose a new method for automatically determining the number of clusters in Web page data sets. We discover that the measure of average inter-cluster similarity reaches a constant of 1.7 when all our experiments produced the best results for clustering Web pages. We determine the number of clusters by using the constant as the stopping factor in our clustering process by arranging individual Web pages into clusters and then arranging the clusters into larger clusters and so on until the average inter-cluster similarity approaches the constant. Having the new method described in this paper together with our new Bidirectional Hierarchical Clustering algorithm reported elsewhere, we have developed a clustering system suitable for mining the Web.

Document Version

Published Version

Peer Reviewed

yes

Keywords

Web Mining, Clustering, Classification, Information Retrieval, Knowledge Discovery

eCommons Citation

Yao, Zhongmei, "Automatically Discovering the Number of Clusters in Web Page Datasets" (2005). Computer Science Faculty Publications. 14.
https://ecommons.udayton.edu/cps_fac_pub/14

Download

Included in

Databases and Information Systems Commons, Graphics and Human Computer Interfaces Commons, Information Security Commons, Numerical Analysis and Scientific Computing Commons, OS and Networks Commons, Other Computer Sciences Commons, Programming Languages and Compilers Commons, Software Engineering Commons, Systems Architecture Commons, Theory and Algorithms Commons

COinS

Computer Science Faculty Publications

Automatically Discovering the Number of Clusters in Web Page Datasets

Document Type

Publication Date

Publication Source

Abstract

Document Version

Peer Reviewed

Keywords

eCommons Citation

Included in

ENTER SEARCH TERMS

Contribute Work

SelectedWorks

Browse

Contribute Work

Browse

Links

Computer Science Faculty Publications

Automatically Discovering the Number of Clusters in Web Page Datasets

Author(s)

Document Type

Publication Date

Publication Source

Abstract

Document Version

Peer Reviewed

Keywords

Sponsoring Agency

eCommons Citation

Included in

Share

ENTER SEARCH TERMS

Contribute Work

SelectedWorks

Browse

Contribute Work

Browse

Links