As more and more information from autonomous databases becomes available to lay users, integrating and querying these databases must adapt to deal with the imprecise nature of user queries as well as incompleteness in the data due to missing attribute values (aka "null values"). In such scenarios, the query processor begins to acquire the role of a recommender system. Specifically, in addition to presenting answers which satisfy the user's query, the query processor is expected to provide highly relevant answers even though they do not exactly satisfy the query predicates.
This broadened view of query processing and autonomous nature of web databases pose many new challenges:
- how to measure result relevance as a combined effect of query imprecision and data incompleteness?
- How to win user's trust to the query results and their ranks generated by the mediator?
- How to infer the relevance of an incomplete tuple with respect to a user query given the restricted data access privileges imposed on web data sources and the limited support for query patterns?
- How to achieve efficiency given the bounded pool of database and network resources in the web environment?
To tackle these challenges, we have developed a suite of techniques as outlined below.
- We introduce a novel query rewriting and optimization framework that retrieves relevant possible tuples with missing values on the query predicates without modifying the web databases. Our technique involves reformulating the user query based on mined correlations among the database attributes.
- We develop techniques to gauge the relevance of the rewritten queries allowing tradeoffs in reducing the costs of database query processing and answer transmission. This involves our proposed methods for mining attribute correlations (in terms of Approximate Functional Dependencies), value distributions (in the form of Naïve Bayes Classifiers), and selectivity estimates.
- We propose a decision theoretic model for ranking answers in the in the order of their expected relevance to the user. This model combines a relevance function that reflects the relevance a user would associate with answer tuples and a density function which reflects the each tuple's distribution of missing data.
- Garrett Wolf, Aravind Kalavagattu, Hemal Khatri1,Raju Balakrishnan, Bhaumik Chokshi, Jianchun Fan, Yi Chen and Subbarao Kambhampati. "Query Processing over Incomplete Autonomous Databases: Query Rewriting Using Learned Data Dependencies".International Journal on Very Large Data Base (VLDBJ), Special Issue on Uncertain and Probabilistic Databases, 18(5):1167-1190, 2009.
- Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchuan Fan, Yi Chen and Subbarao Kambhampati. "Query Processing over Incomplete Autonomous Databases".In Proceedings of 33rd International Conference on Very Large Data Bases (VLDB), 2007.
- Subbarao Kambhampati, Garrett Wolf, Yi Chen, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, and Ullas Nambiar. " QUIC: A System for Handling Imprecision & Incompleteness in Autonomous Databases". In Proceedings of 3rd Biennial Conference on Innovative Data Systems Research (CIDR), 2007.
- Hemal Khatri, Jianchun Fan, Yi Chen, and Subbarao Kambhampati. " QPIAD: Query Processing over Incomplete Autonomous Databases". In Proceedings of 22nd International Conference on Data Engineering (ICDE), 2007.
Faculty:
Students:
Raju Balakrishnan
Bhaumik Chokshi
Jianchun Fan
  Aravind Kalavagattu
Hemal Khatri
Garrett Wolf

