您的位置: 首页 > 院士专题 > 专题 > 详情页

A distributed data management system to support large-scale data analysis

支持海量数据分析的分布式数据管理系统

关键词:
来源:
Journal of Systems and Software
全文链接1:
file1/M00/06/5D/Csgk0FxRaieAET0eAC0fceZAUjQ936.pdf
全文链接2:
https://www.sciencedirect.com/science/article/pii/S0164121218302437
类型:
学术文献
语种:
英文
原文发布日期:
2018-11-08
摘要:
Distributed data management is a key technology to enable efficient massive data processing and analysis in cluster-computing environments. Specifically, in environments where the data volumes are beyond the system capabilities, big data files are required to be summarized by representative samples with the same statistical properties as the whole dataset. This paper proposes a big data management system (BDMS) based on distributed random sample data blocks. It presents a high-level architecture design of the BDMS which extends the current distributed file systems. This system offers certain functionalities for block-level management such as statistically-aware data partitioning, data blocks organization, and data blocks selection. This paper also presents a round-random partitioning scheme to represent a big dataset as a set of non-overlapping data blocks; each block is a random sample of the whole dataset. Based on the presented scheme, two algorithms are introduced as an implementation strategy to convert the HDFS blocks of a big file into a set of random sample data blocks which is also stored in HDFS. The experimental results show that the execution time of partitioning operation is acceptable in the real applications because this operation is only performed once on each input data file.
相关推荐

意 见 箱

匿名:登录

个人用户登录

找回密码

第三方账号登录

忘记密码

个人用户注册

必须为有效邮箱
6~16位数字与字母组合
6~16位数字与字母组合
请输入正确的手机号码