ERIC Number: ED659087
Record Type: Non-Journal
Publication Date: 2024
Pages: 112
Abstractor: As Provided
ISBN: 979-8-3835-3080-1
ISSN: N/A
EISSN: N/A
A Scalable Parallel Processing Design for the Data Washing Machine: An Unsupervised Entity Resolution System
Nicholas Kofi Akortia Hagan
ProQuest LLC, Ph.D. Dissertation, University of Arkansas at Little Rock
Entity Resolution (ER) has been one of the bedrocks in the creation of information systems by ensuring ambiguous entities are identified and resolved by linking. One common design approach of traditional ER systems is to run in single-threaded mode, which makes the system prone to out-of-memory error when processing larger datasets. The Data Washing Machine (DWM) as a proof-of-concept of an unsupervised cluster ER system is indifferent from this common design bottleneck. The original prototype design of the DWM requires shared memory tables and dictionaries of tokens, and its single-threaded nature makes it not scalable, hence not viable for real-world application. Distributed and parallel programming frameworks such as Hadoop MapReduce (MR) and Apache Spark's Resilient Distributed Datasets (RDD) are great fits for scaling ER systems since the comparison of equivalent pairs is independent and can occur in parallel. This dissertation aims at designing and developing a Distributed DWM by adopting the parallel and distributed capability of Hadoop MR and RDD. An initial prototype (HadoopDWM) was developed using Hadoop MR, which was further refactored into SparkDWM using RDD. Experiment results show that HadoopDWM and SparkDWM get the same results as the legacy DWM using optimal starting parameters. A scalability test conducted using 203 million records confirms that HadoopDWM and SparkDWM are scalable, with a total execution time of 7 and 3 hours, respectively. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://bibliotheek.ehb.be:2222/en-US/products/dissertations/individuals.shtml.]
Descriptors: Information Systems, Multivariate Analysis, Information Management, Computer System Design, Information Technology
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://bibliotheek.ehb.be:2222/en-US/products/dissertations/individuals.shtml
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Sponsor: National Science Foundation (NSF), Office of Integrative Activities (OIA)
Authoring Institution: N/A
Grant or Contract Numbers: 1946391