Big Data integration: Record linkage¶

Record linkage: blocking + pairwise matching + clustering - Scalability, similarity, semantics

Blocking: eciently create small blocks of similar records - Ensures scalability

Pairwise matching: compares all record pairs in a block - Computes similarity

Clustering: groups sets of records into entities - Ensures semantics

Volume: dealing with billions of records - Map-reduce based record linkage - Blocking

Velocity

Incremental record linkage

Variety

Matching structured and unstructured data
Matching Web tables and catalogs

Veracity

Linking temporal records

Data Fusion¶

Data fusion: voting + source quality + copy detection - Resolves inconsistency across diversity of sources - Support dierence of opinion

Data fusion: voting + source quality + copy detection - Gives more weight to knowledgeable sources - Reduces weight of copier sources

Rule-based¶

Using the observed value from the most recently updated source
Taking the average, maximum, or minimum for numerical values
Majority voting

Naive voting¶

Supports dierence of opinion, allows conflict resolution Works well for independent sources that have similar accuracy When sources have dierent accuracies - Need to give more weight to votes by knowledgeable sources When sources copy from other sources - Need to reduce the weight of votes by copiers Problem: the wisdom of minority fkccecia,

Truth Discovery¶

A important feature of turth discover is to estimate source reliabilities. To identify the trustworthy information, i.e. truths: - weighted aggregation of data based on the estimated source reliabilities Both source reliabilities and truths are unknown. - If a source provide trustworthy information frequently, it will be assigned a higher reliability. - If a piece of information is supported by soruces with high reliabilities, it will have a larger change to be selected as the truth.

Iteratively until converges: - Truth computation step - Source weight estimation step Truth computation: - The truth is inferred through weighted voting.

Optimization-based Methods

Easier to understand and interpret - Iteration methods Prior knowledge - Optimization-based: can be formulated as extra constraints - Probabilistic graphical model: can be captured by the hyper parameters

Techniques for big data¶

Veracity - Using source trustworthiness - Combining source accuracy and copy detection - Multiple truth values - Erroneous numeric data - Experimental comparison on deep web data

Volume:

Online data fusion

Velocity

Truth discovery for dynamic data

Variety

Combining record linkage with data fusion fkccecia,