Probabilistic query examples

Note: the following queries are test on MySQL.

Preprocessing queries:

The following queries generate probabilistic dirty relations similar to those in Figure. 1 in our paper (These queries also perform "identifier propagation" as described in Section 2.4.2 of the paper)

DROP TABLE IF EXISTS `cora`.`xprob_pubstr03`;
CREATE TABLE  `cora`.`xprob_pubstr03` (
  `tid` int not NULL,
#  `id`  varchar(100),
  `id`  int not NULL,
  `string` varchar(300),
  `prob` double,
  `vidFk` int,
  `cidFk` int
);

INSERT INTO  `cora`.`xprob_pubstr03`
SELECT t.tid, t.id, pt.string, t.prob, v.vid as vidFk, p.id as cidFk
FROM prob_pubtitles03 t, pub2venuef v, prob_pubvenues03 p, pubtitles pt
where v.tid = t.tid and p.tid = v.vid and pt.tid = t.tid
order by id, prob desc;


DROP TABLE IF EXISTS `cora`.`xprob_pubtitles03`;
CREATE TABLE  `cora`.`xprob_pubtitles03` (
  `tid` int not NULL,
#  `id`  varchar(100),
  `id`  int not NULL,
  `string` varchar(300),
  `prob` double,
  `vidFk` int,
  `cidFk` int
);

INSERT INTO  `cora`.`xprob_pubtitles03`
SELECT t.tid, t.id, pt.string, t.prob, v.vid as vidFk, p.id as cidFk
FROM prob_pubtitles03 t, pub2venuef v, prob_pubvenues03 p, pubtitles pt
where v.tid = t.tid and p.tid = v.vid and pt.tid = t.tid
order by id, prob desc;

DROP TABLE IF EXISTS `cora`.`xprob_pubvenues05`;
CREATE TABLE  `cora`.`xprob_pubvenues05` (
  `tid` int not NULL,
  `id`  int not null,
  `name` varchar(300) default NULL,
  `vol` varchar(100) default NULL,
  `date` varchar(100) default NULL,
  `prob` double
);

INSERT INTO  `cora`.`xprob_pubvenues05`
SELECT pr.tid, pr.id, v.name, v.vol, v.date, prob
FROM prob_pubvenues05 pr, pubvenues v
where pr.tid =v.tid;


Examples


Example 1)

Query over clean relation:

select tid
from   pubvenues
where date='1995'

is translated to the following query over the dirty relation =>

select id
from xprob_pubvenues
where `date` = '1995'
group by id

To return all the clean answers with the probabilities:

SELECT j.*, f.sprob FROM
(select id, sum(prob) as sprob
from xprob_pubvenues
where `date` = '1995'
group by id ) f
LEFT JOIN
(SELECT *
FROM xprob_pubvenues) j
ON f.id = j.id
order by sprob desc

Note: In practice, we replace 'date' in the queries with REPLACE(REPLACE(REPLACE(REPLACE(date,".",""),"(",""),")",""),",","") in MySQL in order to possibly retrieve more results (due to existence of brackets and symbols in the data attribute).

Example 2)

Query over clean relation:

select t.tid, v.tid
from xprob_pubvenues v, xprob_pubtitles t
where t.vidFk = v.tid
      and date='1995'

is translated to =>

select t.id, v.id, sum(v.prob*t.prob)
from xprob_pubvenues v, xprob_pubtitles t
where t.cidFk = v.id
      and date='1995'
group by v.id, t.id




Sample Results:

The clean query in example 1 will return:
TID id name vol date
7 4 In Proceedings of CHI95
(1995).
10 5 Presence 4(4) 1995.
28 9

(1995).
41 18 Vision Research 35 (1995)
42 18 Vision Research 35 (1995).
208 39 Perception 24 (1995)

The rewritten query will return:
tid id name vol date prob
7 4 In Proceedings of CHI95
(1995). 0.390027
9 4 In Proceedings of CHI95 242 249.
0.370025
8 4 in Proc. ACM Conference on Human Factors in Computing Systems (CHI95)
May 7-11, 1995, 0.239948
10 5 Presence 4(4) 1995. 1
28 9

(1995). 1
41 18 Vision Research 35 (1995) 0.504419
42 18 Vision Research 35 (1995). 0.495581
208 39 Perception 24 (1995) 1

tuples with tid=8 and tid=9 were missing from the clean query, and are sorted in a reasonable way.