tesser.hadoop

Helps you run a fold on Hadoop!

dsink

(dsink work-dir file-name)

Given a work directory and a string name for this file, builds a dsink for a fold to dump its output to, in [NullWritable FressianWritable] format.

error?

(error? x)

Is this an error object?

execute

(execute graph conf)(execute graph conf job-name)

Like parkour.graph/execute, but specialized for folds. Takes a parkour graph, a jobconf, and a job name. Executes the job, then returns a sequence of fold results. Job names will be automatically generated if not provided.

fold

(fold conf input workdir fold-var & args)

A simple, all-in-one fold operation. Takes a jobconf, workdir, input dseq, var which points to a fold function, and arguments for the fold function. Runs the fold against the dseq and returns its results. Names output dsink after metadata key :tesser.hadoop/output-path in fold symbol. If absent, uses the conf key tesser.hadoop.output-path and finally falls back to the fold symbol. On error, throws an ex-info.

fold*

(fold* graph fold-var & args)

Takes a Parkour graph and applies a fold to it. Takes a var for a function, taking args, which constructs a fold. Returns a new (unexecuted) graph. The output of this job will be a single-element Fressian structure containing the results of the fold applied to the job’s inputs.

fold-mapper

(fold-mapper fold-name fold-args input)

A generic, stateful hadoop mapper for applying a fold to a Hadoop dataset. This function returns a mapper for fold defined by make-fold applied to fold-name & additional args.

fold-reduce-twice

(fold-reduce-twice conf input workdir fold-var & args)

Like fold, but in two stages (jobs). First job uses the number of reduce tasks specified in mapred.reduce.tasks. The second job does nothing in the map step, then uses one reduce task.

fold-reducer

(fold-reducer fold-name fold-args input)

This function returns a parkour reducer for fold defined by make-fold applied to fold-name & additional args

fold-reducer-without-post-combiner

(fold-reducer-without-post-combiner fold-name fold-args input)

Like fold-reducer, but omits :post-combiner so it can be used in reduce tasks that are continued in multiple jobs.

gen-job-name!

(gen-job-name!)

Generates a new job name. Job names start at a random small integer and increment sequentially from there. Job names are printed to stderr when generated.

identity-mapper

(identity-mapper records)

Does nothing in the map step.

job-name-counter

partition-randomly

(partition-randomly _ _ nparts)

Partitions map task outputs randomly and uniformly among the reduce tasks.

print-error

(print-error e)

Print an error to err.

rehydrate-fold

(rehydrate-fold fold-name fold-args)

Takes the name of a function that generates a fold (a symbol) and args for that function, and invokes the function with args to build a fold, which is then compiled and returned.

resolve+

(resolve+ sym)

Resolves a symbol to a var, requiring the namespace if necessary. If the namespace doesn’t exist, throws just like clojure.core/require. If the symbol doesn’t exist after requiring, returns nil.

serialize-error

(serialize-error state input e)

Convert an exception to an error.

set-one-reducer!

(set-one-reducer! conf)

Takes a jobconf, returns the jobconf with mapred.reduce.tasks set to 1.