Skip to content

Big Data integration: Record linkage

Record linkage: blocking + pairwise matching + clustering - Scalability, similarity, semantics

Blocking: eciently create small blocks of similar records - Ensures scalability

Pairwise matching: compares all record pairs in a block - Computes similarity

Clustering: groups sets of records into entities - Ensures semantics

Volume: dealing with billions of records - Map-reduce based record linkage - Blocking

Velocity

  • Incremental record linkage

Variety

  • Matching structured and unstructured data
  • Matching Web tables and catalogs

Veracity

  • Linking temporal records

Data Fusion

Data fusion: voting + source quality + copy detection - Resolves inconsistency across diversity of sources - Support di erence of opinion

Data fusion: voting + source quality + copy detection - Gives more weight to knowledgeable sources - Reduces weight of copier sources

Rule-based

  • Using the observed value from the most recently updated source
  • Taking the average, maximum, or minimum for numerical values
  • Majority voting

Naive voting

Supports dierence of opinion, allows conflict resolution Works well for independent sources that have similar accuracy When sources have di erent accuracies - Need to give more weight to votes by knowledgeable sources When sources copy from other sources - Need to reduce the weight of votes by copiers Problem: the wisdom of minority fkccecia,

Truth Discovery

A important feature of turth discover is to estimate source reliabilities. To identify the trustworthy information, i.e. truths: - weighted aggregation of data based on the estimated source reliabilities Both source reliabilities and truths are unknown. - If a source provide trustworthy information frequently, it will be assigned a higher reliability. - If a piece of information is supported by soruces with high reliabilities, it will have a larger change to be selected as the truth.

Iteratively until converges: - Truth computation step - Source weight estimation step Truth computation: - The truth is inferred through weighted voting.

Optimization-based Methods

Easier to understand and interpret - Iteration methods Prior knowledge - Optimization-based: can be formulated as extra constraints - Probabilistic graphical model: can be captured by the hyper parameters

Techniques for big data

Veracity - Using source trustworthiness - Combining source accuracy and copy detection - Multiple truth values - Erroneous numeric data - Experimental comparison on deep web data

Volume:

  • Online data fusion

Velocity

  • Truth discovery for dynamic data

Variety

  • Combining record linkage with data fusion fkccecia,