Big Data integration: Record linkage¶
Record linkage: blocking + pairwise matching + clustering - Scalability, similarity, semantics
Blocking: eciently create small blocks of similar records - Ensures scalability
Pairwise matching: compares all record pairs in a block - Computes similarity
Clustering: groups sets of records into entities - Ensures semantics
Volume: dealing with billions of records - Map-reduce based record linkage - Blocking
Velocity
- Incremental record linkage
Variety
- Matching structured and unstructured data
- Matching Web tables and catalogs
Veracity
- Linking temporal records
Data Fusion¶
Data fusion: voting + source quality + copy detection - Resolves inconsistency across diversity of sources - Support dierence of opinion
Data fusion: voting + source quality + copy detection - Gives more weight to knowledgeable sources - Reduces weight of copier sources
Rule-based¶
- Using the observed value from the most recently updated source
- Taking the average, maximum, or minimum for numerical values
- Majority voting
Naive voting¶
Supports dierence of opinion, allows conflict resolution Works well for independent sources that have similar accuracy When sources have dierent accuracies - Need to give more weight to votes by knowledgeable sources When sources copy from other sources - Need to reduce the weight of votes by copiers Problem: the wisdom of minority fkccecia,
Truth Discovery¶
A important feature of turth discover is to estimate source reliabilities. To identify the trustworthy information, i.e. truths: - weighted aggregation of data based on the estimated source reliabilities Both source reliabilities and truths are unknown. - If a source provide trustworthy information frequently, it will be assigned a higher reliability. - If a piece of information is supported by soruces with high reliabilities, it will have a larger change to be selected as the truth.
Iteratively until converges: - Truth computation step - Source weight estimation step Truth computation: - The truth is inferred through weighted voting.
Optimization-based Methods
Easier to understand and interpret - Iteration methods Prior knowledge - Optimization-based: can be formulated as extra constraints - Probabilistic graphical model: can be captured by the hyper parameters
Techniques for big data¶
Veracity - Using source trustworthiness - Combining source accuracy and copy detection - Multiple truth values - Erroneous numeric data - Experimental comparison on deep web data
Volume:
- Online data fusion
Velocity
- Truth discovery for dynamic data
Variety
- Combining record linkage with data fusion fkccecia,