Cluster Computing with Working Sets

Description

Read the paper given and write a report about it stating: 1. The topic of the paper What’s discussed/stated in this paper (mainly shown in the abstract section of a paper) 2. Summary of the paper 2.1 What kind of problem does the paper solved 2.2 What is the solution proposed in the paper? How does it solve the problems? 2.3 Advantages of solution proposed in the paper comparing with other solutions? 2.4 Any possible improvements or future work can be achieved based on solution proposed in the paper 3. Please go based of the report template that is provided

Write a report for a paper (no more than two pages, single column).

Write 1000 words report.

Fonts and Spacing

Title: Calibri Light

Content: Times New Roman, 12 points, single line spacing; no indentation between paragraphs

Topic of this paper
What’s discussed/stated in this paper (mainly shown in the abstract section of a paper)

Summary of the paper:What kind of problem does this paper solved?
What is the solution proposed in this paper? How does it solve the problems?
Advantages of solution proposed in this paper comparing with other solutions?
Any possible improvements or future work can be achieved based on solution proposed in this paper?

For your convenience, a report template is available via the link below:

Paper Templet:

Title

Name:

Section 1 Title

Paragraph 1 of Section 1. Write you content here. When you finish this paragraph, just click the enter button on your keyboard and start a new line as the first line of the new paragraph.

Paragraph 2 of Section 1. Continue your content in the new paragraph.

Section 2 Title

Paragraph 1 of Section 2.

Paragraph 2 of Section 2.

Spark: Cluster Computing withWorking SetsMatei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion StoicaUniversity of California, BerkeleyAbstractMapReduce and its variants have been highly successfulin implementing large-scale data-intensive applicationson commodity clusters. However, most of these systemsare built around an acyclic data flow model that is notsuitable for other popular applications. This paper focuseson one such class of applications: those that reusea working set of data across multiple parallel operations.This includes many iterative machine learning algorithms,as well as interactive data analysis tools. We propose anew framework called Spark that supports these applicationswhile retaining the scalability and fault tolerance ofMapReduce. To achieve these goals, Spark introduces anabstraction called resilient distributed datasets (RDDs).An RDD is a read-only collection of objects partitionedacross a set of machines that can be rebuilt if a partitionis lost. Spark can outperform Hadoop by 10x in iterativemachine learning jobs, and can be used to interactivelyquery a 39 GB dataset with sub-second response time.1 IntroductionA new model of cluster computing has become widelypopular, in which data-parallel computations are executedon clusters of unreliable machines by systems that automaticallyprovide locality-aware scheduling, fault tolerance,and load balancing. MapReduce [11] pioneered thismodel, while systems like Dryad [17] and Map-Reduce-Merge [24] generalized the types of data flows supported.These systems achieve their scalability and fault toleranceby providing a programming model where the user createsacyclic data flow graphs to pass input data through a set ofoperators. This allows the underlying system to managescheduling and to react to faults without user intervention.While this data flow programming model is useful for alarge class of applications, there are applications that cannotbe expressed efficiently as acyclic data flows. In thispaper, we focus on one such class of applications: thosethat reuse a working set of data across multiple paralleloperations. This includes two use cases where we haveseen Hadoop users report that MapReduce is deficient:• Iterative jobs: Many common machine learning algorithmsapply a function repeatedly to the same datasetto optimize a parameter (e.g., through gradient descent).While each iteration can be expressed as aMapReduce/Dryad job, each job must reload the datafrom disk, incurring a significant performance penalty.• Interactive analytics: Hadoop is often used to runad-hoc exploratory queries on large datasets, throughSQL interfaces such as Pig [21] and Hive [1]. Ideally,a user would be able to load a dataset of interest intomemory across a number of machines and query it repeatedly.However, with Hadoop, each query incurssignificant latency (tens of seconds) because it runs asa separate MapReduce job and reads data from disk.This paper presents a new cluster computing frameworkcalled Spark, which supports applications withworking sets while providing similar scalability and faulttolerance properties to MapReduce.The main abstraction in Spark is that of a resilient distributeddataset (RDD), which represents a read-only collectionof objects partitioned across a set of machines thatcan be rebuilt if a partition is lost. Users can explicitlycache an RDD in memory across machines and reuse itin multiple MapReduce-like parallel operations. RDDsachieve fault tolerance through a notion of lineage: if apartition of an RDD is lost, the RDD has enough informationabout how it was derived from other RDDs to beable to rebuild just that partition. Although RDDs arenot a general shared memory abstraction, they representa sweet-spot between expressivity on the one hand andscalability and reliability on the other hand, and we havefound them well-suited for a variety of applications.Spark is implemented in Scala [5], a statically typedhigh-level programming language for the Java VM, andexposes a functional programming interface similar toDryadLINQ [25]. In addition, Spark can be used interactivelyfrom a modified version of the Scala interpreter,which allows the user to define RDDs, functions, variablesand classes and use them in parallel operations on acluster. We believe that Spark is the first system to allowan efficient, general-purpose programming language to beused interactively to process large datasets on a cluster.Although our implementation of Spark is still a prototype,early experience with the system is encouraging. Weshow that Spark can outperform Hadoop by 10x in iterativemachine learning workloads and can be used interactivelyto scan a 39 GB dataset with sub-second latency.This paper is organized as follows. Section 2 describes1Spark’s programming model and RDDs. Section 3 showssome example jobs. Section 4 describes our implementation,including our integration into Scala and its interpreter.Section 5 presents early results. We survey relatedwork in Section 6 and end with a discussion in Section 7.2 Programming ModelTo use Spark, developers write a driver program that implementsthe high-level control flow of their applicationand launches various operations in parallel. Spark providestwo main abstractions for parallel programming:resilient distributed datasets and parallel operations onthese datasets (invoked by passing a function to apply ona dataset). In addition, Spark supports two restricted typesof shared variables that can be used in functions runningon the cluster, which we shall explain later.2.1 Resilient Distributed Datasets (RDDs)A resilient distributed dataset (RDD) is a read-only collectionof objects partitioned across a set of machines thatcan be rebuilt if a partition is lost. The elements of anRDD need not exist in physical storage; instead, a handleto an RDD contains enough information to compute theRDD starting from data in reliable storage. This meansthat RDDs can always be reconstructed if nodes fail.In Spark, each RDD is represented by a Scala object.Spark lets programmers construct RDDs in four ways:• From a file in a shared file system, such as the HadoopDistributed File System (HDFS).• By “parallelizing” a Scala collection (e.g., an array)in the driver program, which means dividing it into anumber of slices that will be sent to multiple nodes.• By transforming an existing RDD. A dataset with elementsof type A can be transformed into a dataset withelements of type B using an operation called flatMap,which passes each element through a user-providedfunction of type A ) List[B].1 Other transformationscan be expressed using flatMap, including map(pass elements through a function of type A ) B)and filter (pick elements matching a predicate).• By changing the persistence of an existing RDD. Bydefault, RDDs are lazy and ephemeral. That is, partitionsof a dataset are materialized on demand whenthey are used in a parallel operation (e.g., by passinga block of a file through a map function), and are discardedfrom memory after use.2 However, a user canalter the persistence of an RDD through two actions:– The cache action leaves the dataset lazy, but hintsthat it should be kept in memory after the first timeit is computed, because it will be reused.1flatMap has the same semantics as the map in MapReduce, but mapis usually used to refer to a one-to-one function of type A ) B in Scala.2This is how “distributed collections” function in DryadLINQ.– The save action evaluates the dataset and writesit to a distributed filesystem such as HDFS. Thesaved version is used in future operations on it.We note that our cache action is only a hint: if there isnot enough memory in the cluster to cache all partitions ofa dataset, Spark will recompute them when they are used.We chose this design so that Spark programs keep working(at reduced performance) if nodes fail or if a dataset istoo big. This idea is loosely analogous to virtual memory.We also plan to extend Spark to support other levels ofpersistence (e.g., in-memory replication across multiplenodes). Our goal is to let users trade off between the costof storing an RDD, the speed of accessing it, the probabilityof losing part of it, and the cost of recomputing it.2.2 Parallel OperationsSeveral parallel operations can be performed on RDDs:• reduce: Combines dataset elements using an associativefunction to produce a result at the driver program.• collect: Sends all elements of the dataset to the driverprogram. For example, an easy way to update an arrayin parallel is to parallelize, map and collect the array.• foreach: Passes each element through a user providedfunction. This is only done for the side effects of thefunction (which might be to copy data to another systemor to update a shared variable as explained below).We note that Spark does not currently support agrouped reduce operation as in MapReduce; reduce resultsare only collected at one process (the driver).3 Weplan to support grouped reductions in the future usinga “shuffle” transformation on distributed datasets, as describedin Section 7. However, even using a single reduceris enough to express a variety of useful algorithms.For example, a recent paper on MapReduce for machinelearning on multicore systems [10] implemented tenlearning algorithms without supporting parallel reduction.2.3 Shared VariablesProgrammers invoke operations like map, filter and reduceby passing closures (functions) to Spark. As is typicalin functional programming, these closures can refer tovariables in the scope where they are created. Normally,when Spark runs a closure on a worker node, these variablesare copied to the worker. However, Spark also letsprogrammers create two restricted types of shared variablesto support two simple but common usage patterns:• Broadcast variables: If a large read-only piece of data(e.g., a lookup table) is used in multiple parallel operations,it is preferable to distribute it to the workersonly once instead of packaging it with every closure.Spark lets the programmer create a “broadcast vari-3Local reductions are first performed at each node, however.2able” object that wraps the value and ensures that it isonly copied to each worker once.• Accumulators: These are variables that workers canonly “add” to using an associative operation, and thatonly the driver can read. They can be used to implementcounters as in MapReduce and to provide amore imperative syntax for parallel sums. Accumulatorscan be defined for any type that has an “add”operation and a “zero” value. Due to their “add-only”semantics, they are easy to make fault-tolerant.3 ExamplesWe now show some sample Spark programs. Note that weomit variable types because Scala supports type inference.3.1 Text SearchSuppose that we wish to count the lines containing errorsin a large log file stored in HDFS. This can be implementedby starting with a file dataset object as follows:val file = spark.textFile(“hdfs://…”)val errs = file.filter(.contains(“ERROR”)) val ones = errs.map( => 1)val count = ones.reduce(+)We first create a distributed dataset called file thatrepresents the HDFS file as a collection of lines. We transformthis dataset to create the set of lines containing “ERROR”(errs), and then map each line to a 1 and add upthese ones using reduce. The arguments to filter, map andreduce are Scala syntax for function literals.Note that errs and ones are lazy RDDs that are nevermaterialized. Instead, when reduce is called, each workernode scans input blocks in a streaming manner to evaluateones, adds these to perform a local reduce, and sends itslocal count to the driver. When used with lazy datasets inthis manner, Spark closely emulates MapReduce.Where Spark differs from other frameworks is that itcan make some of the intermediate datasets persist acrossoperations. For example, if wanted to reuse the errsdataset, we could create a cached RDD from it as follows:val cachedErrs = errs.cache()We would now be able to invoke parallel operations oncachedErrs or on datasets derived from it as usual, butnodes would cache partitions of cachedErrs in memoryafter the first time they compute them, greatly speedingup subsequent operations on it.3.2 Logistic RegressionThe following program implements logistic regression[3], an iterative classification algorithm that attempts tofind a hyperplane w that best separates two sets of points.The algorithm performs gradient descent: it starts w at arandom value, and on each iteration, it sums a function ofw over the data to move w in a direction that improves it.It thus benefits greatly from caching the data in memoryacross iterations. We do not explain logistic regression indetail, but we use it to show a few new Spark features.// Read points from a text file and cache themval points = spark.textFile(…).map(parsePoint).cache()// Initialize w to random D-dimensional vectorvar w = Vector.random(D)// Run multiple iterations to update wfor (i <- 1 to ITERATIONS) { val grad = spark.accumulator(new Vector(D)) for (p <- points) { // Runs in parallel val s = (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y grad += s * p.x } w -= grad.value } First, although we create an RDD called points, we process it by running a for loop over it. The for keyword in Scala is syntactic sugar for invoking the foreach method of a collection with the loop body as a closure. That is, the code for(p <- points){body} is equivalent to points.foreach(p => {body}). Therefore,we are invoking Spark’s parallel foreach operation.Second, to sum up the gradient, we use an accumulatorvariable called gradient (with a value of type V ector).Note that the loop adds to gradient using an overloaded+= operator. The combination of accumulators and forsyntax allows Spark programs to look much like imperativeserial programs. Indeed, this example differs from aserial version of logistic regression in only three lines.3.3 Alternating Least SquaresOur final example is an algorithm called alternating leastsquares (ALS). ALS is used for collaborative filteringproblems, such as predicting users’ ratings for movies thatthey have not seen based on their movie rating history (asin the Netflix Challenge). Unlike our previous examples,ALS is CPU-intensive rather than data-intensive.We briefly sketch ALS and refer the reader to [27] fordetails. Suppose that we wanted to predict the ratings of uusers for m movies, and that we had a partially filled matrixR containing the known ratings for some user-moviepairs. ALS models R as the product of two matrices Mand U of dimensions m × k and k × u respectively; thatis, each user and each movie has a k-dimensional “featurevector” describing its characteristics, and a user’s ratingfor a movie is the dot product of its feature vector and themovie’s. ALS solves for M and U using the known ratingsand then computes M × U to predict the unknownones. This is done using the following iterative process:

Initialize M to a random value.
Optimize U given M to minimize error on R.3
Optimize M given U to minimize error on R.
Repeat steps 2 and 3 until convergence.ALS can be parallelized by updating different users /movies on each node in steps 2 and 3. However, becauseall of the steps use R, it is helpful to make R a broadcastvariable so that it does not get re-sent to each node on eachstep. A Spark implementation of ALS that does is shownbelow. Note that we parallelize the collection 0 until u(a Scala range object) and collect it to update each array:val Rb = spark.broadcast(R)for (i <- 1 to ITERATIONS) { U = spark.parallelize(0 until u) .map(j => updateUser(j, Rb, M)).collect()M = spark.parallelize(0 until m).map(j => updateUser(j, Rb, U)).collect()}4 ImplementationSpark is built on top of Mesos [16, 15], a “cluster operatingsystem” that lets multiple parallel applications sharea cluster in a fine-grained manner and provides an APIfor applications to launch tasks on a cluster. This allowsSpark to run alongside existing cluster computing frameworks,such as Mesos ports of Hadoop and MPI, and sharedata with them. In addition, building on Mesos greatly reducedthe programming effort that had to go into Spark.The core of Spark is the implementation of resilient distributeddatasets. As an example, suppose that we definea cached dataset called cachedErrs representing errormessages in a log file, and that we count its elements usingmap and reduce, as in Section 3.1:val file = spark.textFile(“hdfs://…”)val errs = file.filter(.contains(“ERROR”)) val cachedErrs = errs.cache() val ones = cachedErrs.map( => 1)val count = ones.reduce(+)These datasets will be stored as a chain of objects capturingthe lineage of each RDD, shown in Figure 1. Eachdataset object contains a pointer to its parent and informationabout how the parent was transformed.Internally, each RDD object implements the same simpleinterface, which consists of three operations:• getPartitions, which returns a list of partition IDs.• getIterator(partition), which iterates over a partition.• getPreferredLocations(partition), which is used fortask scheduling to achieve data locality.When a parallel operation is invoked on a dataset, Sparkcreates a task to process each partition of the dataset andsends these tasks to worker nodes. We try to send eachHdfsTextFilepath = hdfs://…file:FilteredDatasetfunc = .contains(…) errs: cachedErrs: CachedDataset MappedDataset func = => 1 ones:Figure 1: Lineage chain for the distributed dataset objects definedin the example in Section 4.task to one of its preferred locations using a techniquecalled delay scheduling [26]. Once launched on a worker,each task calls getIterator to start reading its partition.The different types of RDDs differ only in how theyimplement the RDD interface. For example, for a Hdfs-TextFile, the partitions are block IDs in HDFS, their preferredlocations are the block locations, and getIteratoropens a stream to read a block. In a MappedDataset, thepartitions and preferred locations are the same as for theparent, but the iterator applies the map function to elementsof the parent. Finally, in a CachedDataset, thegetIterator method looks for a locally cached copy of atransformed partition, and each partition’s preferred locationsstart out equal to the parent’s preferred locations, butget updated after the partition is cached on some node toprefer reusing that node. This design makes faults easy tohandle: if a node fails, its partitions are re-read from theirparent datasets and eventually cached on other nodes.Finally, shipping tasks to workers requires shippingclosures to them—both the closures used to define a distributeddataset, and closures passed to operations such asreduce. To achieve this, we rely on the fact that Scala closuresare Java objects and can be serialized using Java serialization;this is a feature of Scala that makes it relativelystraightforward to send a computation to another machine.Scala’s built-in closure implementation is not ideal, however,because we have found cases where a closure objectreferences variables in the closure’s outer scope that arenot actually used in its body. We have filed a bug reportabout this, but in the meantime, we have solved the issueby performing a static analysis of closure classes’ bytecodeto detect these unused variables and set the correspondingfields in the closure object to null. We omitthe details of this analysis due to lack of space.Shared Variables: The two types of shared variables inSpark, broadcast variables and accumulators, are implementedusing classes with custom serialization formats.When one creates a broadcast variable b with a value v,v is saved to a file in a shared file system. The serializedform of b is a path to this file. When b’s value is queried4on a worker node, Spark first checks whether v is in alocal cache, and reads it from the file system if it isn’t.We initially used HDFS to broadcast variables, but we aredeveloping a more efficient streaming broadcast system.Accumulators are implemented using a different “serializationtrick.” Each accumulator is given a unique IDwhen it is created. When the accumulator is saved, itsserialized form contains its ID and the “zero” value forits type. On the workers, a separate copy of the accumulatoris created for each thread that runs a task usingthread-local variables, and is reset to zero when a task begins.After each task runs, the worker sends a message tothe driver program containing the updates it made to variousaccumulators. The driver applies updates from eachpartition of each operation only once to prevent doublecountingwhen tasks are re-executed due to failures.Interpreter Integration: Due to lack of space, we onlysketch how we have integrated Spark into the Scala interpreter.The Scala interpreter normally operates by compilinga class for each line typed by the user. This classincludes a singleton object that contains the variables orfunctions on that line and runs the line’s code in its constructor.For example, if the user types var x = 5 followedby println(x), the interpreter defines a class (sayLine1) containing x and causes the second line to compileto println(Line1.getInstance().x). Theseclasses are loaded into the JVM to run each line. To makethe interpreter work with Spark, we made two changes:
We made the interpreter output the classes it definesto a shared filesystem, from which they can be loadedby the workers using a custom Java class loader.
We changed the generated code so that the singletonobject for each line references the singleton objectsfor previous lines directly, rather than going throughthe static getInstance methods. This allows closuresto capture the current state of the singletons theyreference whenever they are serialized to be sent to aworker. If we had not done this, then updates to thesingleton objects (e.g., a line setting x = 7 in the exampleabove) would not propagate to the workers.5 ResultsAlthough our implementation of Spark is still at an earlystage, we relate the results of three experiments that showits promise as a cluster computing framework.Logistic Regression: We compared the performance ofthe logistic regression job in Section 3.2 to an implementationof logistic regression for Hadoop, using a 29 GBdataset on 20 “m1.xlarge” EC2 nodes with 4 cores each.The results are shown in Figure 2. With Hadoop, eachiteration takes 127s, because it runs as an independentMapReduce job. With Spark, the first iteration takes 174s(likely due to using Scala instead of Java), but subsequent010002000300040001 5 10 20 30Running Time (s)Number of IterationsHadoopSparkFigure 2: Logistic regression performance in Hadoop and Spark.iterations take only 6s, each because they reuse cacheddata. This allows the job to run up to 10x faster.We have also tried crashing a node while the job wasrunning. In the 10-iteration case, this slows the job downby 50s (21%) on average. The data partitions on thelost node are recomputed and cached in parallel on othernodes, but the recovery time was rather high in the currentexperiment because we used a high HDFS block size(128 MB), so there were only 12 blocks per node and therecovery process could not utilize all cores in the cluster.Smaller block sizes would yield faster recovery times.Alternating Least Squares: We have implemented thealternating least squares job in Section 3.3 to measure thebenefit of broadcast variables for iterative jobs that copya shared dataset to multiple nodes. We found that withoutusing broadcast variables, the time to resend the ratingsmatrix R on each iteration dominated the job’s runningtime. Furthermore, with a na¨ıve implementation of broadcast(using HDFS or NFS), the broadcast time grew linearlywith the number of nodes, limiting the scalability ofthe job. We implemented an application-level multicastsystem to mitigate this. However, even with fast broadcast,resending R on each iteration is costly. Caching Rin memory on the workers using a broadcast variable improvedperformance by 2.8x in an experiment with 5000movies and 15000 users on a 30-node EC2 cluster.Interactive Spark: We used the Spark interpreter toload a 39 GB dump of Wikipedia in memory across 15“m1.xlarge” EC2 machines and query it interactively. Thefirst time the dataset is queried, it takes roughly 35 seconds,comparable to running a Hadoop job on it. However,subsequent queries take only 0.5 to 1 seconds, evenif they scan all the data. This provides a qualitatively differentexperience, comparable to working with local data.6 Related WorkDistributed Shared Memory: Spark’s resilient distributeddatasets can be viewed as an abstraction for distributedshared memory (DSM), which has been studiedextensively [20]. RDDs differ from DSM interfaces intwo ways. First, RDDs provide a much more restricted5programming model, but one that lets datasets be rebuiltefficiently if cluster nodes fail. While some DSM systemsachieve fault tolerance through checkpointing [18], Sparkreconstructs lost partitions of RDDs using lineage informationcaptured in the RDD objects. This means that onlythe lost partitions need to be recomputed, and that theycan be recomputed in parallel on different nodes, withoutrequiring the program to revert to a checkpoint. In addition,there is no overhead if no nodes fail. Second, RDDspush computation to the data as in MapReduce [11], ratherthan letting arbitrary nodes access a global address space.Other systems have also restricted the DSM programmingmodel to improve performance, reliability and programmability.Munin [8] lets programmers annotate variableswith the access pattern they will have so as to choosean optimal consistency protocol for them. Linda [13] providesa tuple space programming model that may be implementedin a fault-tolerant fashion. Thor [19] providesan interface to persistent shared objects.Cluster Computing Frameworks: Spark’s parallel operationsfit into the MapReduce model [11]. However,they operate on RDDs that can persist across operations.The need to extend MapReduce to support iterative jobswas also recognized by Twister [6, 12], a MapReduceframework that allows long-lived map tasks to keep staticdata in memory between jobs. However, Twister does notcurrently implement fault tolerance. Spark’s abstractionof resilient distributed datasets is both fault-tolerant andmore general than iterative MapReduce. A Spark programcan define multiple RDDs and alternate between runningoperations on them, whereas a Twister program has onlyone map function and one reduce function. This alsomakes Spark useful for interactive data analysis, wherea user can define several datasets and then query them.Spark’s broadcast variables provide a similar facility toHadoop’s distributed cache [2], which can disseminate afile to all nodes running a particular job. However, broadcastvariables can be reused across parallel operations.Language Integration: Spark’s language integration issimilar to that of DryadLINQ [25], which uses .NET’ssupport for language integrated queries to capture an expressiontree defining a query and run it on a cluster.Unlike DryadLINQ, Spark allows RDDs to persist inmemory across parallel operations. In addition, Sparkenriches the language integration model by supportingshared variables (broadcast variables and accumulators),implemented using classes with custom serialized forms.We were inspired to use Scala for language integrationby SMR [14], a Scala interface for Hadoop that uses closuresto define map and reduce tasks. Our contributionsover SMR are shared variables and a more robust implementationof closure serialization (described in Section 4).Finally, IPython [22] is a Python interpreter for scientiststhat lets users launch computations on a cluster usinga fault-tolerant task queue interface or low-level messagepassing interface. Spark provides a similar interactive interface,but focuses on data-intensive computations.Lineage: Capturing lineage or provenance informationfor datasets has long been a research topic in the scientificcomputing an database fields, for applications suchas explaining results, allowing them to be reproduced byothers, and recomputing data if a bug is found in a workflowstep or if a dataset is lost. We refer the reader to [7],[23] and [9] for surveys of this work. Spark provides a restrictedparallel programming model where fine-grainedlineage is inexpensive to capture, so that this informationcan be used to recompute lost dataset elements.7 Discussion and Future WorkSpark provides three simple data abstractions for programmingclusters: resilient distributed datasets (RDDs),and two restricted types of shared variables: broadcastvariables and accumulators. While these abstractions arelimited, we have found that they are powerful enough toexpress several applications that pose challenges for existingcluster computing frameworks, including iterative andinteractive computations. Furthermore, we believe thatthe core idea behind RDDs, of a dataset handle that hasenough information to (re)construct the dataset from dataavailable in reliable storage, may prove useful in developingother abstractions for programming clusters.In future work, we plan to focus on four areas:
Formally characterize the properties of RDDs andSpark’s other abstractions, and their suitability for variousclasses of applications and workloads.
Enhance the RDD abstraction to allow programmersto trade between storage cost and re-construction cost.
Define new operations to transform RDDs, includinga “shuffle” operation that repartitions an RDD by agiven key. Such an operation would allow us to implementgroup-bys and joins.
Provide higher-level interactive interfaces on top ofthe Spark interpreter, such as SQL and R [4] shells.8 AcknowledgementsWe thank Ali Ghodsi for his feedback on this paper. Thisresearch was supported by California MICRO, CaliforniaDiscovery, the Natural Sciences and Engineering ResearchCouncil of Canada, as well as the following BerkeleyRAD Lab sponsors: Sun Microsystems, Google, Microsoft,Amazon, Cisco, Cloudera, eBay, Facebook, Fujitsu,HP, Intel, NetApp, SAP, VMware, and Yahoo!.References[1] Apache Hive. http://hadoop.apache.org/hive.6[2] Hadoop Map/Reduce tutorial. http://hadoop.apache.org/common/docs/r0.20.0/mapred tutorial.html.[3] Logistic regression – Wikipedia.http://en.wikipedia.org/wiki/Logistic regression.[4] The R project for statistical computing.http://www.r-project.org.[5] Scala programming language. http://www.scala-lang.org.[6] Twister: Iterative MapReduce.http://iterativemapreduce.org.[7] R. Bose and J. Frew. Lineage retrieval for scientific dataprocessing: a survey. ACM Computing Surveys, 37:1–28,2005.[8] J. B. Carter, J. K. Bennett, and W. Zwaenepoel.Implementation and performance of Munin. In SOSP ’91.ACM, 1991.[9] J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance indatabases: Why, how, and where. Foundations and Trendsin Databases, 1(4):379–474, 2009.[10] C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski,A. Y. Ng, and K. Olukotun. Map-reduce for machinelearning on multicore. In NIPS ’06, pages 281–288. MITPress, 2006.[11] J. Dean and S. Ghemawat. MapReduce: Simplified dataprocessing on large clusters. Commun. ACM,51(1):107–113, 2008.[12] J. Ekanayake, S. Pallickara, and G. Fox. MapReduce fordata intensive scientific analyses. In ESCIENCE ’08,pages 277–284, Washington, DC, USA, 2008. IEEEComputer Society.[13] D. Gelernter. Generative communication in linda. ACMTrans. Program. Lang. Syst., 7(1):80–112, 1985.[14] D. Hall. A scalable language, and a scalable framework.http://www.scala-blogs.org/2008/09/scalable-languageand-scalable.html.[15] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D.Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: Aplatform for fine-grained resource sharing in the datacenter. Technical Report UCB/EECS-2010-87, EECSDepartment, University of California, Berkeley, May2010.[16] B. Hindman, A. Konwinski, M. Zaharia, and I. Stoica. Acommon substrate for cluster computing. In Workshop onHot Topics in Cloud Computing (HotCloud) 2009, 2009.[17] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly.Dryad: Distributed data-parallel programs from sequentialbuilding blocks. In EuroSys 2007, pages 59–72, 2007.[18] A.-M. Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, andI. Puaut. A recoverable distributed shared memoryintegrating coherence and recoverability. In FTCS ’95.IEEE Computer Society, 1995.[19] B. Liskov, A. Adya, M. Castro, S. Ghemawat, R. Gruber,U. Maheshwari, A. C. Myers, M. Day, and L. Shrira. Safeand efficient sharing of persistent objects in thor. InSIGMOD ’96, pages 318–329. ACM, 1996.[20] B. Nitzberg and V. Lo. Distributed shared memory: asurvey of issues and algorithms. Computer, 24(8):52 –60,aug 1991.[21] C. Olston, B. Reed, U. Srivastava, R. Kumar, andA. Tomkins. Pig latin: a not-so-foreign language for dataprocessing. In SIGMOD ’08. ACM, 2008.[22] F. P´erez and B. E. Granger. IPython: a system forinteractive scientific computing. Comput. Sci. Eng.,9(3):21–29, May 2007.[23] Y. L. Simmhan, B. Plale, and D. Gannon. A survey ofdata provenance in e-science. SIGMOD Rec.,34(3):31–36, 2005.[24] H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker.Map-reduce-merge: simplified relational data processingon large clusters. In SIGMOD ’07, pages 1029–1040.ACM, 2007.[25] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U´ . Erlingsson,P. K. Gunda, and J. Currey. DryadLINQ: A system forgeneral-purpose distributed data-parallel computing usinga high-level language. In OSDI ’08, San Diego, CA, 2008.[26] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy,S. Shenker, and I. Stoica. Delay scheduling: A simpletechnique for achieving locality and fairness in clusterscheduling. In EuroSys 2010, April 2010.[27] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan.Large-scale parallel collaborative filtering for the Netflixprize. In AAIM ’08, pages 337–348, Berlin, Heidelberg,
Springer-Verlag.7

Get professional assignment help cheaply

Are you busy and do not have time to handle your assignment? Are you scared that your paper will not make the grade? Do you have responsibilities that may hinder you from turning in your assignment on time? Are you tired and can barely handle your assignment? Are your grades inconsistent?

Whichever your reason may is, it is valid! You can get professional academic help from our service at affordable rates. We have a team of professional academic writers who can handle all your assignments.

Our essay writers are graduates with diplomas, bachelor, masters, Ph.D., and doctorate degrees in various subjects. The minimum requirement to be an essay writer with our essay writing service is to have a college diploma. When assigning your order, we match the paper subject with the area of specialization of the writer.

Why choose our academic writing service?

Plagiarism free papers
Timely delivery
Any deadline
Skilled, Experienced Native English Writers
Subject-relevant academic writer
Adherence to paper instructions
Ability to tackle bulk assignments
Reasonable prices
24/7 Customer Support
Get superb grades consistently

Get Professional Assignment Help Cheaply
Are you busy and do not have time to handle your assignment? Are you scared that your paper will not make the grade? Do you have responsibilities that may hinder you from turning in your assignment on time? Are you tired and can barely handle your assignment? Are your grades inconsistent?
Whichever your reason may is, it is valid! You can get professional academic help from our service at affordable rates. We have a team of professional academic writers who can handle all your assignments.
Our essay writers are graduates with diplomas, bachelor’s, masters, Ph.D., and doctorate degrees in various subjects. The minimum requirement to be an essay writer with our essay writing service is to have a college diploma. When assigning your order, we match the paper subject with the area of specialization of the writer.
Why Choose Our Academic Writing Service?

How It Works
1.      Place an order
You fill all the paper instructions in the order form. Make sure you include all the helpful materials so that our academic writers can deliver the perfect paper. It will also help to eliminate unnecessary revisions.
2.      Pay for the order
Proceed to pay for the paper so that it can be assigned to one of our expert academic writers. The paper subject is matched with the writer’s area of specialization.
3.      Track the progress
You communicate with the writer and know about the progress of the paper. The client can ask the writer for drafts of the paper. The client can upload extra material and include additional instructions from the lecturer. Receive a paper.
4.      Download the paper
The paper is sent to your email and uploaded to your personal account. You also get a plagiarism report attached to your paper.

PLACE THIS ORDER OR A SIMILAR ORDER WITH Essay fount TODAY AND GET AN AMAZING DISCOUNT

The post Cluster Computing with Working Sets appeared first on Essay fount.

What Students Are Saying About Us

.......... Customer ID: 12*** | Rating: ⭐⭐⭐⭐⭐
"Honestly, I was afraid to send my paper to you, but you proved you are a trustworthy service. My essay was done in less than a day, and I received a brilliant piece. I didn’t even believe it was my essay at first 🙂 Great job, thank you!"

.......... Customer ID: 11***| Rating: ⭐⭐⭐⭐⭐
"This company is the best there is. They saved me so many times, I cannot even keep count. Now I recommend it to all my friends, and none of them have complained about it. The writers here are excellent."

Cluster Computing with Working Sets

What Students Are Saying About Us

"Order a custom Paper on Similar Assignment at essayfount.com! No Plagiarism! Enjoy 20% Discount!"

About us

Quick Links

Our Policies

Contact Us

What Students Are Saying About Us

"Order a custom Paper on Similar Assignment at essayfount.com! No Plagiarism! Enjoy 20% Discount!"

Related posts:

About us

Quick Links

Our Policies

Contact Us

Cookie and Privacy Settings