Document representations for classification of short web-page descriptions

Authors

  • Miloš Radovanović Faculty of Science, Department of Mathematics and Informatics, Novi Sad
  • Mirjana Ivanović Faculty of Science, Department of Mathematics and Informatics, Novi Sad

DOI:

https://doi.org/10.2298/YJOR0801123R

Keywords:

text categorization, document representation, machine learning

Abstract

Motivated by applying Text Categorization to classification of Web search results, this paper describes an extensive experimental study of the impact of bag-of- words document representations on the performance of five major classifiers - Naïve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts, representing short Web-page descriptions sorted into a large hierarchy of topics, are taken from the dmoz Open Directory Web-page ontology, and classifiers are trained to automatically determine the topics which may be relevant to a previously unseen Web-page. Different transformations of input data: stemming, normalization, logtf and idf, together with dimensionality reduction, are found to have a statistically significant improving or degrading effect on classification performance measured by classical metrics - accuracy, precision, recall, F1 and F2. The emphasis of the study is not on determining the best document representation which corresponds to each classifier, but rather on describing the effects of every individual transformation on classification, together with their mutual relationships. .

References

Aggarwal, C.C., Hinneburg, A., Keim, D.A. (2001) On the surprising behavior of distance metrics in high dimensional spaces. u: ICDT'01, International Conference on Database Theory (8th), London: Springer-Verlag, 420-434

Aha, D., Kibler, D., Albert, M.K. (1991) Instance-based learning algorithms. Machine learning, 6(1): 37

Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U. (1999) When is nearest neighbors meaningful?. u: ICDT99

Chen, H., Dumais, T. (2000) Bringing order to the web: Automatically categorizing search results. u: Proc. of ACM SIGCHI Conference on Human Factors In Computing Systems(CHI), pp. 145-152

Debole, F., Sebastiani, F. (2004) Supervised term weighting for automated text categorization. u: Sirmakessis S. [ur.] Text mining and its applications, studies in fuzziness and soft computing 138, Heidelberg, Germany: Physica-Verlag, str. 81-98

Ferragina, P., Gulli, A. (2005) A personalized search engine based on Web-snippet hierarchical clustering. u: Proc. WWW'05, 14th International World Wide Web Conference, Chiba, Japan, 801-810

Forman, G. (2003) An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3(7-8): 1289

Freund, Y., Schapire, R.E. (1999) Large margin classification using the perceptron algorithm. Machine learning, 37(3): 277-296

Gabrilovich, E., Markovitch, S. (2004) Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. u: Proc. ICML'04, 21st International Conference on Machine Learning, Banff, Canada, 41-48

Grobelnik, M., Mladenić, D. (2005) Simple classification into large topic ontology of Web documents. u: Proc. ITI'05, 27th International Conference on Information Technology Interfaces, Cavtat, Croatia, 188-193

Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., i dr. (2001) Improvements to Platt's SMO algorithm for SVM classifier design. Neural Computation, 13(3): 637

Kibriya, A.M., Frank, E., Pfahringer, B., Holmes, G. (2004) Multinomial naive Bayes for text categorization revisited. u: Proc. AI'04, 17th Australian Joint Conference on Artificial Intelligence, LNAI 3339, Cairns, Australia: Springer, 488-499

Lan, M., Tan, C.L., Low, H.B., Sung, S.Y. (2005) A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. u: 14th International World Wide Web Conference, Chiba, Japan, Proc. WWW'05, 1032-1033

Leopold, E., Kindermann, J. (2002) Text categorization with support vector machines. How to represent texts in input space?. Machine learning, 46(1/3): 423

Mccallum, A., Nigam, K. (1998) A comparison of event models for naive Bayes text classification. u: AAAI'98 Workshop on Learning for Text Categorization, Proc, 41-48

Mladenić, D. (1998) Machine learning on non-homogenous, distributed text data. Ljubljana, Slovenia: University, PhD thesis

Mladenić, D. (1999) Text-learning and related intelligent agents. u: IEEE EXPERT, Special Issue on Applications of intelligent information retrieval

Mladenić, D., Grobelnik, M. (2004) Mapping documents onto Web page ontology. u: Berendt B.i dr. [ur.] Web Mining: From Web to Semantic Web, LNAI 3209, Springer-Verlag

Nadeau, C., Bengio, Y. (2003) Inference for the generalization error. Machine learning, 52(3): 239

Nunzio, D.M. (2004) A bidimensional view of documents for text categorization. u: Proc. ECIR'04, 26th European Conference on IR Research, LNCS 2997, Sunderland, UK, Springer-Verlag, str. 112-126

Peng, F., Schuurmans, D. (2003) Combining naive bayes and n-gram language models for text classification. u: ECIR'03,IR Research, LNCS 2663,(25th), European Conference, Pisa, Italy, Springer-Verlag, 335-350

Platt, J. (1998) Fast training of support vector machines using sequential minimal optimization. u: Schoelkopf B., Burges C., Smola A. [ur.] Advances in Kernel Methods-Support Vector Learning, MIT Press, str. 185-208

Porter, M.F. (1980) An algorithm for suffix stripping. Program, 14 (3), 130-137, July

Quinlan, R. (1993) C4. 5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann Publishers

Radovanović, M., Ivanović, M. (2006) Cats: A classification-powered meta-search engine. u: Last M., Szczepaniak, Volkovich Z., Kandel A. [ur.] Advances in Web Intelligence and Data Mining: Studies in Computational Intelligence 23, Springer-Verlag, 191-200

Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R. (2003) Tackling the poor assumptions of naive bays text classifiers. u: Machine Learning, ICML'03, 20th, International Conference, Proc

Ribeiro, A., Fresno, V., Garcia-Alegre, M.C., Guinea, D. (2003) Web page classification: A soft computing approach. u: AWIC'03, 1st, Atlantic Web Intelligence Conference, LNAI 2663, Madrid, Spain, Proc, Springer-Verlag, 103-112

Salton, G., ur. (1971) The SMART retrieval system: Experiments in automatic document processing. Prentice-Hall

Sebastiani, F. (2005) Text categorization. u: Zanasi Alessandro [ur.] Text Mining and its Applications, Southampton, UK: WIT Press, str. 109-129

Sebastiani, F. (2002) Machine learning in automated text categorization. ACM Computing Surveys, 34(1): 1

Soucy, P., Mineau, G.W. (2005) Beyond TFIDF weighting for text categorization in the vector space model. u: Proc. IJCAI'05, 19th International Joint Conference on Artificial Intelligence, Edinburgh, UK, 1130-1135

Stricker, M., Vichot, F., Dreyfus, D., Wolinski, F. (2000) Vers la conception automatique de filtres d'informations efficacies. u: Reconnaissance des Formes et Intelligence Artificielle RFIA2000, 129-137

Witten, I.H., Frank, E. (2005) Data mining: Practical machine learning tools and techniques with java implementations. San Francisco, CA, itd: Morgan Kaufmann

Wu, X., Srihari, R., Zheng, Z. (2004) Document representation for one-class SVM. u: Proc. ECML'04, 15th European Conference on Machine Learning, LNAI 3201, Pisa, Italy, Springer-Verlag, 489-500

Yetisgen-Yildiz, M., Pratt, W. (2005) The effect of feature representation on Medline document classification. u: Proc. AMIA'05, American Medical Informatics Association Fall Symposium, Washington DC

Zadrozny, S., Kacprzyk, J. (2006) Computing with words for text processing: An approach to the text categorization. Information Sciences, 176(4): 415

Downloads

Published

2008-03-01

Issue

Section

Research Articles