Document representations for classification of short web-page descriptions
DOI:
https://doi.org/10.2298/YJOR0801123RKeywords:
text categorization, document representation, machine learningAbstract
Motivated by applying Text Categorization to classification of Web search results, this paper describes an extensive experimental study of the impact of bag-of- words document representations on the performance of five major classifiers - Naïve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts, representing short Web-page descriptions sorted into a large hierarchy of topics, are taken from the dmoz Open Directory Web-page ontology, and classifiers are trained to automatically determine the topics which may be relevant to a previously unseen Web-page. Different transformations of input data: stemming, normalization, logtf and idf, together with dimensionality reduction, are found to have a statistically significant improving or degrading effect on classification performance measured by classical metrics - accuracy, precision, recall, F1 and F2. The emphasis of the study is not on determining the best document representation which corresponds to each classifier, but rather on describing the effects of every individual transformation on classification, together with their mutual relationships. .References
Aggarwal, C.C., Hinneburg, A., Keim, D.A. (2001) On the surprising behavior of distance metrics in high dimensional spaces. u: ICDT'01, International Conference on Database Theory (8th), London: Springer-Verlag, 420-434
Aha, D., Kibler, D., Albert, M.K. (1991) Instance-based learning algorithms. Machine learning, 6(1): 37
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U. (1999) When is nearest neighbors meaningful?. u: ICDT99
Chen, H., Dumais, T. (2000) Bringing order to the web: Automatically categorizing search results. u: Proc. of ACM SIGCHI Conference on Human Factors In Computing Systems(CHI), pp. 145-152
Debole, F., Sebastiani, F. (2004) Supervised term weighting for automated text categorization. u: Sirmakessis S. [ur.] Text mining and its applications, studies in fuzziness and soft computing 138, Heidelberg, Germany: Physica-Verlag, str. 81-98
Ferragina, P., Gulli, A. (2005) A personalized search engine based on Web-snippet hierarchical clustering. u: Proc. WWW'05, 14th International World Wide Web Conference, Chiba, Japan, 801-810
Forman, G. (2003) An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3(7-8): 1289
Freund, Y., Schapire, R.E. (1999) Large margin classification using the perceptron algorithm. Machine learning, 37(3): 277-296
Gabrilovich, E., Markovitch, S. (2004) Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. u: Proc. ICML'04, 21st International Conference on Machine Learning, Banff, Canada, 41-48
Grobelnik, M., Mladenić, D. (2005) Simple classification into large topic ontology of Web documents. u: Proc. ITI'05, 27th International Conference on Information Technology Interfaces, Cavtat, Croatia, 188-193
Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., i dr. (2001) Improvements to Platt's SMO algorithm for SVM classifier design. Neural Computation, 13(3): 637
Kibriya, A.M., Frank, E., Pfahringer, B., Holmes, G. (2004) Multinomial naive Bayes for text categorization revisited. u: Proc. AI'04, 17th Australian Joint Conference on Artificial Intelligence, LNAI 3339, Cairns, Australia: Springer, 488-499
Lan, M., Tan, C.L., Low, H.B., Sung, S.Y. (2005) A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. u: 14th International World Wide Web Conference, Chiba, Japan, Proc. WWW'05, 1032-1033
Leopold, E., Kindermann, J. (2002) Text categorization with support vector machines. How to represent texts in input space?. Machine learning, 46(1/3): 423
Mccallum, A., Nigam, K. (1998) A comparison of event models for naive Bayes text classification. u: AAAI'98 Workshop on Learning for Text Categorization, Proc, 41-48
Mladenić, D. (1998) Machine learning on non-homogenous, distributed text data. Ljubljana, Slovenia: University, PhD thesis
Mladenić, D. (1999) Text-learning and related intelligent agents. u: IEEE EXPERT, Special Issue on Applications of intelligent information retrieval
Mladenić, D., Grobelnik, M. (2004) Mapping documents onto Web page ontology. u: Berendt B.i dr. [ur.] Web Mining: From Web to Semantic Web, LNAI 3209, Springer-Verlag
Nadeau, C., Bengio, Y. (2003) Inference for the generalization error. Machine learning, 52(3): 239
Nunzio, D.M. (2004) A bidimensional view of documents for text categorization. u: Proc. ECIR'04, 26th European Conference on IR Research, LNCS 2997, Sunderland, UK, Springer-Verlag, str. 112-126
Peng, F., Schuurmans, D. (2003) Combining naive bayes and n-gram language models for text classification. u: ECIR'03,IR Research, LNCS 2663,(25th), European Conference, Pisa, Italy, Springer-Verlag, 335-350
Platt, J. (1998) Fast training of support vector machines using sequential minimal optimization. u: Schoelkopf B., Burges C., Smola A. [ur.] Advances in Kernel Methods-Support Vector Learning, MIT Press, str. 185-208
Porter, M.F. (1980) An algorithm for suffix stripping. Program, 14 (3), 130-137, July
Quinlan, R. (1993) C4. 5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann Publishers
Radovanović, M., Ivanović, M. (2006) Cats: A classification-powered meta-search engine. u: Last M., Szczepaniak, Volkovich Z., Kandel A. [ur.] Advances in Web Intelligence and Data Mining: Studies in Computational Intelligence 23, Springer-Verlag, 191-200
Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R. (2003) Tackling the poor assumptions of naive bays text classifiers. u: Machine Learning, ICML'03, 20th, International Conference, Proc
Ribeiro, A., Fresno, V., Garcia-Alegre, M.C., Guinea, D. (2003) Web page classification: A soft computing approach. u: AWIC'03, 1st, Atlantic Web Intelligence Conference, LNAI 2663, Madrid, Spain, Proc, Springer-Verlag, 103-112
Salton, G., ur. (1971) The SMART retrieval system: Experiments in automatic document processing. Prentice-Hall
Sebastiani, F. (2005) Text categorization. u: Zanasi Alessandro [ur.] Text Mining and its Applications, Southampton, UK: WIT Press, str. 109-129
Sebastiani, F. (2002) Machine learning in automated text categorization. ACM Computing Surveys, 34(1): 1
Soucy, P., Mineau, G.W. (2005) Beyond TFIDF weighting for text categorization in the vector space model. u: Proc. IJCAI'05, 19th International Joint Conference on Artificial Intelligence, Edinburgh, UK, 1130-1135
Stricker, M., Vichot, F., Dreyfus, D., Wolinski, F. (2000) Vers la conception automatique de filtres d'informations efficacies. u: Reconnaissance des Formes et Intelligence Artificielle RFIA2000, 129-137
Witten, I.H., Frank, E. (2005) Data mining: Practical machine learning tools and techniques with java implementations. San Francisco, CA, itd: Morgan Kaufmann
Wu, X., Srihari, R., Zheng, Z. (2004) Document representation for one-class SVM. u: Proc. ECML'04, 15th European Conference on Machine Learning, LNAI 3201, Pisa, Italy, Springer-Verlag, 489-500
Yetisgen-Yildiz, M., Pratt, W. (2005) The effect of feature representation on Medline document classification. u: Proc. AMIA'05, American Medical Informatics Association Fall Symposium, Washington DC
Zadrozny, S., Kacprzyk, J. (2006) Computing with words for text processing: An approach to the text categorization. Information Sciences, 176(4): 415
Downloads
Published
Issue
Section
License
Copyright (c) YUJOR
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.