Abstract:
Author’s Names Disambiguation (AD) is a type of record linkage which is applied to scholarly documents. The ambiguity often occurs due to different factors such as authors who have more than one name version, or group of authors who share the same name. Therein, it is difficult to distinguish between scholarly document authors or to group scholarly documents by authors. Machine learning techniques provide a solution to deal with this challenge by training the machine to classify all the documents belonging to a certain author and distinguish them from works of other authors sharing the same name. However, AD is still a great challenge due to the ever-increasing size of digital libraries and the lack of training examples that represent the whole domain. This study aims at providing a solution by using ORCID citations as a large and reliable source of training data. A comparison study has been made among a group of machine learning approaches including j48, DNN, Naive Bayesian and Random forest. The results from the experiment have proven that Random forest classifier is the best among them with almost 95% accuracy. In addition, coauthors feature was the most important instance compared to the other instances which has an impact of 12.9% in eliminating ambiguity in author’s names.
Description:
CD, no of pages 79, informatics 2/2018, 30943