| | |
| | |
Stat |
Members: 3645 Articles: 2'504'928 Articles rated: 2609
25 April 2024 |
|
| | | |
|
Article overview
| |
|
Unsupervised record matching with noisy and incomplete data | Yves van Gennip
; Blake Hunter
; Anna Ma
; Daniel Moyer
; Ryan de Vera
; Andrea L. Bertozzi
; | Date: |
10 Apr 2017 | Abstract: | We consider the problem of duplicate detection: given a large data set in
which each entry has multiple attributes, detect which distinct entries refer
to the same real world entity. Our method consists of three main steps:
creating a similarity score between entries, grouping entries together into
’unique entities’, and refining the groups. We compare various methods for
creating similarity scores, considering different combinations of string
matching, term frequency-inverse document frequency methods, and n-gram
techniques. In particular, we introduce a vectorized soft term
frequency-inverse document frequency method, with an optional refinement step.
We test our method on the Los Angeles Police Department Field Interview Card
data set, the Cora Citation Matching data set, and two sets of restaurant
review data. The results show that in certain parameter ranges soft term
frequency-inverse document frequency methods can outperform the standard term
frequency-inverse document frequency method; they also confirm that our method
for automatically determining the number of groups typically works well in many
cases and allows for accurate results in the absence of a priori knowledge of
the number of unique entities in the data set. | Source: | arXiv, 1704.2955 | Services: | Forum | Review | PDF | Favorites |
|
|
No review found.
Did you like this article?
Note: answers to reviews or questions about the article must be posted in the forum section.
Authors are not allowed to review their own article. They can use the forum section.
browser Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
|
| |
|
|
|
| News, job offers and information for researchers and scientists:
| |