American Journal of Information Science and Computer Engineering
Articles Information
American Journal of Information Science and Computer Engineering, Vol.7, No.3, Sep. 2021, Pub. Date: Sep. 21, 2021
On the Development of Machine Learning Algorithms for Information Extraction of Structured Academic Data from Unstructured Web Documents
Pages: 36-51 Views: 63 Downloads: 31
Authors
[01] Joshua Babatunde Agbogun, Department of Mathematical Sciences, Kogi State University, Anyigba, Nigeria.
[02] Vincent Andrew Akpan, Department of Biomedical Technology, The Federal University of Technology, Akure, Nigeria.
Abstract
This paper proposes a machine learning approach for information extraction of structured academic data from unstructured web documents. The current challenges of information extraction have been critically examined as well as the state-of-the-art of structured data extraction. The approach used has been simplified and presented using a comprehensive flowchart. The machine learning information extraction scheme was validated using Kogi State University (KSU), Anyigba, Kogi State-Nigeria. The preliminary studies of KSU as well as an organogram of KSU are presented in the paper. The feasibility and realization of the machine learning algorithms for information extraction of structured academic data from unstructured web documents were highlighted and the goals accomplished were also listed.
Keywords
Artificial Neural Networks, Information Extraction, Machine Learning, Structured Academic Data, Unstructured Web Documents
References
[01] T. Saracevic, (2009). Information Science. In M. J. Bates (ED.), Encyclopaedia of library and information sciences (3rded.) (pp. 2570-2585). New York: Taylor and francis.
[02] P. M. Andersen, P. J. Hayes, A. K. Huettner, L. M. Schmandt, I. B. Nirenburg andS. P. Weinstein,(1992): “Automatic Extraction of Facts from Press Releases to Generate News Stories”, 1992 ANLC’92 Proceedings of the third conference of the Applied Natural Language Processing, 1992, pp. 170-177.
[03] J. Cowie and Y. Wilks, “Information Extraction”, Retrieved on Saturday 13th January, 2018 from: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.6480., 1996.
[04] Wiki, “Information Extraction”, Wikipedia-The Free Encyclopedia, Retrieved 6th April, 2018, Available [Online]: https://en.wikipedia.org/wiki/Information_extraction, 2018.
[05] A. Akbik and J. Broß. “Wanderlust: Extracting semantic relations from natural language text using dependency grammar patterns”, In Proceedings of WWW Workshop, 2009. pp. 205-216.
[06] A. Arasu and H. Garcia-Molina, “Extracting structured data from web pages” Proceedings of the 2003 Association for Computing Machinery Special Interest Group on Management of Data (ACM SIGMOD),International Conference on Management of Data, 2003. pp 337-348.
[07] T. Berners-Lee, “TED: Talk on the Next Web”. Retrieved on Saturday 13th January, 2018 from: https://www.ted.com/talks/tim_berners_lee_on_the_next_web, 2009.
[08] C. C. Aggarwal and C. X. Zhai, “Mining Text Data”,DOI 10.1007/978-1-4614-3223-4_2, © Springer Science+Business Media, LLC 2012.
[09] D. Freitag, “Machine Learning for Information Extraction in Informal Domains”, Kluwer Academic Publishers. Printed in The Netherlands, 2000.
[10] A. Zils, F. Pachet, O. Delerue and F. Gouyon, “Automatic Extraction of Drum Tracks from Polyphonic Music Signals (http://www.csl.sony.fr/downloads/papers/2002/ZilsMusic.pdf), In Proceedings of WedelMusic, Darmstadt, Germany, 2002.
[11] F. Peng and A. McCallum, A. (2006): “Information extraction from research papers using conditional random fields”, Information Processing & Management, vol. 42, no. 4, 2006, pp. 963. doi: 10.1016/j.ipm.2005.09.002.
[12] N. Shimizu and A. Hass, “Extracting Frame-based Knowledge Representation from Route Instructions”, 2006. Retrieved on Saturday 13th January, 2018 from https://pdfs.semanticscholar.org/fb72/c577ef096d9705ba26e21be0a3db93c6500b.pdf
[13] C. Blaschke and A. Valencia, “The frame-based module of the Suiseki information extraction system”, IEEE Intelligent Systems, vol. 17, 2002, pp. 14-20.
[14] C. Cardie, “Empirical methods in information extraction”, AI Magazine, vol. 18, no. 4, 1997, pp. 65-79.
[15] M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), Orlando, FL, July 1999, pp. 328-334.
[16] M. E. Califf and R. J. Mooney, “Bottom-up relational learning of pattern matching rules for information extraction”, Journal of Machine Learning Research, 4: 177-210, 2003.
[17] L. Wall, T. Christiansen and R. L. Schwartz, “Programming Perl”, O’Reilly and Associates, Sebastopol, CA, 1996.
[18] N. Kushmerick, D. S. Weld and R. B. Doorenbos, “Wrapper induction for information extraction”, In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), Nagoya, Japan, 1997, pp. 729-735.
[19] D. Freitag and N. Kushmerick, “Boosted wrapper induction”, In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), Austin, TX, July 2000. AAAI Press/The MIT Press, pp. 577-583.
[20] S. H. Muggleton, “Inductive Logic Programming”, Academic Press, New York, NY, 1992.
[21] D. Freitag, “Toward general-purpose learning for information extraction”, In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and COLING-98 (ACL/COLING-98), Montreal, Quebec, 1998, pp. 404-408.
[22] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition”, In Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, 1989.
[23] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data”, In Proceedings of 18th International Conference on Machine Learning (ICML-2001), Williamstown, MA, 2001, pp. 282-289.
[24] D. M. Bikel, R. Schwartz, and R. M. Weischedel, “An algorithm that learns what’s in a name”, Machine Learning, 34: 211-232, 1999.
[25] D. Freitag and A. McCallum, “Information extraction with HMM structures learned by stochastic optimization”, In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), Austin, TX, AAAI Press/The MIT Press, 2000.
[26] F. Peng and A. McCallum, “Accurate information extraction from research papers using conditional random fields”, In Proceedings of Human Language Technology Conference/North American Association for Computational Linguistics Annual Meeting (HLT-NAACL-2004), Boston, MA, 2004.
[27] S. Sarawagi and W. W. Cohen, “Semi-markov conditional random fields for information extraction”, In Advances in Neural Information Processing Systems 17, Vancouver, Canada, 2005.
[28] A. J. Viterbi, “Error bounds for convolutional codes and and asymptotically optimum decoding algorithm”, IEEE Transactions on Information Theory, vol. 13, no. 2, pp. 260-269, 1967.
[29] S. W. Bennett, C. Aone, and C. Lovell, “Learning to tag multilingual texts through observation, In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP-97),, Providence, RI, 1997, pp. 109-116.
[30] X. Carreras, L. M`arquez, and L. Padr´o, “A simple named entity extractor using AdaBoost”, In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003.
[31] F. D. Meulder and W. Daelemans, “Memory-based named entity recognition using unannotated data”, In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003.
[32] J. Mayfield, P. McNamee, and C. Piatko, “Named entity recognition using hundreds of thousands of features”, In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003.
[33] H. L. Chieu and H. T. Ng, “Named entity recognition with a maximum entropy approach”, In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), pages 160-163, Edmonton, Canada, 2003.
[34] L. Tanabe and W. J. Wilbur, “Tagging gene and protein names in biomedical text”, Bioinformatics, vol. 18, no. 8, pp. 1124-1132, 2002.
[35] E. F. T. K. Sang and F. D. Meulder, “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition”, In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003.
[36] K. W. Church, “A stochastic parts program and noun phrase parser for unrestricted text”, In Proceedings of the Second Conference on Applied Natural Language Processing, Austin, TX, Association for Computational Linguistics, 1988, pp. 136-143.
[37] E. Brill, “Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging”, Computational Linguistics, vol. 21, no. 4, pp. 543-565, 1995.
[38] D. Zelenko, C. Aone, and A. Richardella, “Kernel method for relation extraction”, Journal of Machine Learning Research, vol. 3, 2003, pp. 1083-1106.
[39] L. A. Ramshaw and Mitch P. Marcus, “Text chunking using transformation-based learning”, In Proceedings of the 3rd Workshop on Very Large Corpora, 1995, pp. 82-94.
[40] M. J. Collins, “Three generative, lexicalised models for statistical parsing”, In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL-97), 1997, pp. 16-23.
[41] A. Culotta and J. Sorensen, “Dependency tree kernels for relation extraction”, In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain, July 2004.
[42] S. Ray and M. Craven. “Representing sentence structure in hidden Markov models for information extraction”, In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-2001), Seattle, WA, 2001, pp. 1273-1279.
[43] C. D. Fellbaum, “WordNet: An Electronic Lexical Database”, MIT Press, Cambridge, MA, 1998.
[44] A. McCallum, A. (2005): “Information Extraction: Distilling structured data from unstructured text in a Magazine”, Queue-Social Computing, ACM New York, NY, USA, vol. 3, no. 9, November 2005, pp. 48-57.
[45] J. Tang, M. Hong, D. Zhang, B. Liang and J. Li, “Information Extraction: Methodologies and Applications”, In the book of Emerging Technologies of Text Mining: Techniques and Applications, Hercules A. Prado and Edilson Ferneda (Ed.), Idea Group Inc., Hershey, USA, 2007, pp. 1-33. http://keg.cs.tsinghua.edu.cn/jietang/publications/Tang-et-al-Information_Extraction.pdf
[46] Sequentum, Visual Web Ripper V2.123.2 (released on 23rd of April 2014,) downloaded from“www.visualwebripper.com”. http://www.sequentum.com/-See more at: http://www.windows8downloads.com/win8-visual-web-ripper-zmsizlqf/#sthash.GBimouoO.dpuf
[47] C. Sunandan, S. Lakshminarayanan and N. Yaw, “Extraction of (Key,Value) Pairs from Unstructured Ads.”, Association for the Advancement of Artificial Intelligence (www.aaai.org), 2014 Retreived from https://www.aaai.org/ocs/index.php/FSS/FSS14/paper/viewFile/9196/9080
[48] S. C. Gowri, Dr. K.M. Sundaram (2015), “A Study on Information Retrieval and Extraction for Text Data Words using Data Mining Classifier”, International Journal of Computer Science and Mobile Computing (IJCSMC), vol. 4 no. 10, October 2015, pp. 121-126.
[49] K. Arvinder and C. Deepti, “Comparison of Text Mining Tools”, In Proceedings of the 5th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), AIIT, Amity University Uttar Pradesh, Noida, India, Sep. 7-9, 2016.
[50] M. Kejriwal and P. Szekely, “Information Extraction in Illicit Web Domains”, International World Wide Web Conference Committee (IW3C2), published under Creative Commons CC BY 4.0 License, WWW’17 Perth, Australia. ACM, 2017 978-1-4503-4913-0/17/04, http://dx.doi.org/10.1145/3038912.3052642
[51] W. Yanshan, W. Liwei,R. M. Majid, M. Sungrim, S. Feichen, A. Naveed, L. Sijia, Z. Yuqun, M. Saeed M, S. Sunghwan and L. Hongfang, “Clinical Information Extraction Applications: A literature review”, Journal of Biomedical Informatics, vol. 77, 2018, pp. 34-49. https://doi.org/10.1016/j.jbi.2017.11.011
[52] A. McCallum and D. Jensen, “A note on the unification of information extraction and data mining using conditional-probability, relational models”, In Proceedings of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data, Acapulco, Mexico, Aug. 2003.
[53] S. Abteboul, P. Buneman and P. Suciu, “Data on the Web: From Relations to Semi-Structured Data and XML”, The Morgan Kaufmann Series in Data Management, 26th October, 1999.
[54] A. L. Bergert, V. J. Della Pietra, and S. A. Della Pietra, “A maximum entropy approach to natural language processing”, Computational Linguistics, vol. 22, no. 1, pp. 39-71, March 1996.
[55] F. Wu and D. S. Weld, “Open information extraction using Wikipedia”, In Proceedings of the 48th Annual Meeting of the Association of Computational Linguistics, Uppsala, Sweden, 11th-16th July, 2010, pp. 118-127.
[56] S. Brin, “Extracting patterns and relations from the World Wide Web”, In Proceedings of the 1998 International Workshop on the Web and Databases, 1998.
[57] F. Dong, M. Liu and Y. Li (2013), “Automatic Extraction of Semi-structured Web Data”, International Journal of Database Theory and Application. Vol. 6, No. 4, pp. 131-144, August, 2013.
600 ATLANTIC AVE, BOSTON,
MA 02210, USA
+001-6179630233
AIS is an academia-oriented and non-commercial institute aiming at providing users with a way to quickly and easily get the academic and scientific information.
Copyright © 2014 - American Institute of Science except certain content provided by third parties.