ACM SIGMOD/PODS Conference: Vancouver, 2008
Program: SIGMOD Undergraduate Posters
The following undergraduate students had posters accepted for SIGMOD 2008:
NN-Queries over Dynamic Graph Data
Alexander G. Connor
Deptartment of Computer Science, University of Pittsburgh, USA
A major drawback of most GIS implementations is that they do not support efficient nearest-neighbor(NN) queries on road networks. The traditional method is to represent objects on a road network as vertices on a graph; NN queries are answered by a top-k search. This search takes linear time and cannot be optimized by pre-computing and storing each vertex's nearest neighbors when a vertex's weight can change dynamically. We propose an algorithm for finding nearest neighbors in such edge-weighted vertex-weighted graphs like those used to represent road networks. Our proposed algorithm runs in logarithmic expected time in the number of vertices. Along with the NN search algorithm we propose an algorithm that will allow for the weight of a vertex in the graph to be updated in logarithmic time in the number of vertices. We provide theoretical as well as experimental results that uphold the correctness and expected runtime of each algorithm. In order to illustrate the problem's applicability, we simulate an emergency response system in a city where NN queries are used to determine nearest and most available ambulances and hospitals.
Frequent Pattern Mining on Graphics Processors
Department of Computer Science and Technology, Peking University, China
We study frequent pattern mining on Graphics Processors (GPUs). GPUs are traditionally designed for gaming applications and have recently been used for general-purpose computing (GPGPU). Compared with current multi-core CPUs, GPUs have an order of magnitude higher computational power as well as memory bandwidth. Nevertheless, the GPU architecture and programming model is largely optimized for graphics processing, where data items are processed uniformly and control flows consist of a few rather fixed stages. Therefore, it is a challenging task to map CPU-based data mining algorithms onto the GPU.
As a first step, we study the a-priori mining algorithm on the GPU. Our GPU-based a-priori adopts the vertical data layout, where the database consists of lists of transaction IDs associated with each item. We count the support (frequency) of each item efficiently by intersecting TID-lists in parallel. We adopt partition pruning and TID-list compression to further improve the efficiency. We evaluated our algorithms on a PC with an NVIDIA G80 GPU and an Intel Core2Duo CPU. Our GPU-based algorithm is up to 10 times faster than its CPU-based counterpart.
This is joint work with Bingsheng He, Yifan He, Qiong Luo, Bo Peng, and Xiangye Xiao.
CC-Optimizer: A Cache Conscious Query Optimizer with Buffering Operator
Department of Information and Computer Scicence, Keio University, Japan
The access cost gap between CPU and memory incurs performance degradations on RDBMS. One of the reasons why instruction cache misses occur is too large footprint size of RDBMS operators which does not fit in a L1 instruction cache. To solve this problem, Zhou et al proposed the buffering operator (BO). Although BO improves performance in some cases, it also degrades performance in some cases. However, they did not clearly analyze the effective and ineffective cases. On this background, we investigated BO behavior experimentally by reimplementing it, and realized a cache conscious query optimizer called CC-Optimizer (CCO) based on the result of the investigation. CCO applies BO only when it is appropriate. Our contributions are the design of the optimizer and its implementation on a RDBMS. To evaluate CCO, we compare the performance of normal PostgreSQL (7.3.16) with PostgreSQL using CCO on Linux Kernel 2.6.15. OSDL DBT-3 test suite is used for the performance measurement. The result of experiments show that the performance improvement achieved by CCO. It is 73.7% in the maximum case (Q4), and 17.5% in the average of all queries.
A Flexible Approach for Extracting Metadata from Bibliographic Citations
Eli Cortez C. Vilarinho
Computer Science Department, Federal University of Amazonas, Brazil
In this work we address the problem of extracting components of bibliographic citations found on scientific articles. We propose a method called FLUX-CiM (Flexible Unsupervised Extraction of Citation Metadata). Differently from previous approaches based on models learned from user-driven training, our approach relies on a knowledge-base automatically constructed from an existing set of sample metadata records. Our technique shows a high degree of automation and flexibility, as demonstrated by experiments we have performed.
We experimented our method comparing it with Conditional Random Fields (CRF), the state-of-art in the literature for this problem. The extraction quality obtained by FLUX-CiM, even without user intervention, reached F-Measure levels above 92% (almost 3% higher in average than CRF) on three distinct citation datasets, including CORA, the one used in the original CRF paper. Another experiment we carried out showed that our method deals with several distinct citation styles without compromising the extraction quality, a feature not present in CRF whose extraction quality degrades with the number of distinct citation styles used. The results obtained in these experiments in comparison with the state-of-art research lead us to regard our method as the best cost effective method for metadata citation extraction in the literature.
Privacy Preservation for Multiple Sensitive Attributes
Department of Computer Science, Tsinghua University, China
Aiming at ensuring privacy preservation in personal data publication, the topic of anonymization has been intensively studied in recent years. Anonymization techniques typically eliminate identifying attributes and perform generalization on QI attributes to partition the table into QI-groups, each composed of tuples with identical generalized QI values. Anonymization principles such as k-anonymity, l-diversity put constraints on QI-groups. However,they all simply assume each tuple in the table contains one single sensitive attribute (the SSA case), none paying attention to the case of multiple sensitive attributes in a tuple (the MSA case). In this poster, we conduct the pioneering study on MSA, observe new privacy risk, and reason why anonymization is impractical for this case. Instead, we propose the approach of sensitive attributes transformation to tackle such issue. Sensitive attribute transformation includes three types of operations: generalization, suppression and noising. We also investigate other advantages of our approach, including personal customized privacy preservation, privacy preservation for dynamic database with inserting, deleting and updating. We also observe our approach can be degenerated to the SSA case. Extensive experiments with real dataset verify our technique, substantiating its effects in MSA case and advantages over the state of the art in SSA case.