博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
MapReduce: Simplified Data Processing on Large ...
阅读量:6940 次
发布时间:2019-06-27

本文共 3666 字,大约阅读时间需要 12 分钟。

hot3.png

MapReduce: Simplified Data Processing on Large Clusters

Abstract

MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules intermachine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google’s clusters every day, processing a total of more than twenty petabytes of data per day. 

1 Introduction

Prior to our development of MapReduce, the authors and many others at Google implemented hundreds of special-purpose computations that process large amounts of raw data, such as crawled documents, Web request logs, etc., to compute various kinds of derived data, such as inverted indices, various representations of the graph structure of Web documents, summaries of the number of pages crawled per host, and the set of most frequent queries in a given day. Most such computations are conceptually straightforward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues.

As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault tolerance, data distribution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record’ in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use reexecution as the primary mechanism for fault tolerance. 

The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs. The programming model can also be used to parallelize computations across multiple cores of the same machine.

Section 2 describes the basic programming model and gives several examples. In Section 3, we describe an implementation of the MapReduce interface tailored towards our cluster-based computing environment. Section 4 describes several refinements of the programming model that we have found useful. Section 5 has performance measurements of our implementation for a variety of tasks. In Section 6, we explore the use of MapReduce within Google including our experiences in using it as the basis for a rewrite of our production indexing system. Section 7 discusses related and future work. 

转载于:https://my.oschina.net/zetaplusae/blog/116929

你可能感兴趣的文章
hive 12及以后,可以使用非同步查询
查看>>
CentOS 7安装部署ELK 6.2.4
查看>>
通过ansible批量安装部署mysql
查看>>
Kafka/Metaq设计思想学习笔记
查看>>
Voice Lab 8-SIP笔记
查看>>
如何在 block 中修改外部变量
查看>>
fedora update to 23
查看>>
linux的运行级别及相应含义
查看>>
阿里的分布式持续集成系统-reliable
查看>>
【转】单日峰值2T发送量邮件营销平台实践经验
查看>>
Dell Compellent的一些缺陷
查看>>
我的友情链接
查看>>
分布式消息系统 Kafka 简介
查看>>
内部控制
查看>>
iOS地图选址
查看>>
我的友情链接
查看>>
自动监控linux服务器负载并重启Web服务的脚本
查看>>
四、Windows Server 2012R2 Hyper-v虚拟交换机的创建与管理
查看>>
java 运算顺序
查看>>
天涯LVS部署
查看>>