MegaSparkDiff 0.4.0 API - org.finra.msd.sparkfactory.SparkFactory

final def !=(arg0: Any): Boolean

Definition Classes: AnyRef → Any

final def ##(): Int

Definition Classes: AnyRef → Any

final def ==(arg0: Any): Boolean

Definition Classes: AnyRef → Any

final def asInstanceOf[T0]: T0

Definition Classes: Any

def clone(): AnyRef

Attributes: protected[lang]
Definition Classes: AnyRef
Annotations: @throws( ... ) @native()

var conf: SparkConf

final def eq(arg0: AnyRef): Boolean

Definition Classes: AnyRef

def equals(arg0: Any): Boolean

Definition Classes: AnyRef → Any

def finalize(): Unit

Attributes: protected[lang]
Definition Classes: AnyRef
Annotations: @throws( classOf[java.lang.Throwable] )

def flattenDataFrame(df: DataFrame, delimiter: String): DataFrame

Flattens a DataFrame to only have a single column that contains the entire original row

df: a dataframe which contains one or more columns
delimiter: a character which separates the source data
returns: flattened dataframe

final def getClass(): Class[_]

Definition Classes: AnyRef → Any
Annotations: @native()

def hashCode(): Int

Definition Classes: AnyRef → Any
Annotations: @native()

def initializeDataBricks(dataBricksSparkSession: SparkSession): Unit

def initializeSparkContext(): Unit

The initialize method creates the main spark session which MegaSparkDiff uses.

The initialize method creates the main spark session which MegaSparkDiff uses. This needs to be called before any operation can be made using MegaSparkDiff. It creates a spark app with the name "megasparkdiff" and enables hive support by default.

def initializeSparkLocalMode(numCores: String, logLevel: String, defaultPartitions: String): Unit

This method should be used to initialize MegaSparkDiff in local mode, in other words anything that is not EMR or EC2.

This method should be used to initialize MegaSparkDiff in local mode, in other words anything that is not EMR or EC2. The typical use cases are if you are executing a diff on your laptop or workstation or within a Jenkins build.

numCores: this parameters can be used to set the number of cores you wanna specify for spark. for example you can specify "local[1]" this means spark will use 1 core only. Alternatively you can specify "local[*]" this means spark will figure out how many cores you have and will use them all.

final def isInstanceOf[T0]: Boolean

Definition Classes: Any

final def ne(arg0: AnyRef): Boolean

Definition Classes: AnyRef

final def notify(): Unit

Definition Classes: AnyRef
Annotations: @native()

final def notifyAll(): Unit

Definition Classes: AnyRef
Annotations: @native()

def parallelizeCSVSource(filePath: String, tempViewName: String, schemaDef: Option[StructType] = Option.empty, delimiter: Option[String] = Option.apply(",")): AppleTable

This method creates an AppleTable for data in a CSV file

filePath: the relative path to the CSV file
tempViewName: the name of the temporary view which gets created for source data
schemaDef: optional schema definition for the data. If none is provided, the schema will be inferred by spark
delimiter: optional delimiter character for the data. If none is provided, comma (",") will be used as default
returns: custom table containing the data to be compared

def parallelizeDynamoDBSource(tableName: String, tempViewName: String, firstLevelElementNames: Array[String], delimiter: Option[String] = Option.apply(","), selectColumns: Option[Array[String]] = Option.empty, filter: Option[String] = Option.empty, region: Option[String] = Option.empty, roleArn: Option[String] = Option.empty, readPartitions: Option[String] = Option.empty, maxPartitionBytes: Option[String] = Option.empty, defaultParallelism: Option[String] = Option.empty, targetCapacity: Option[String] = Option.empty, stronglyConsistentReads: Option[String] = Option.empty, bytesPerRCU: Option[String] = Option.empty, filterPushdown: Option[String] = Option.empty, throughput: Option[String] = Option.empty): AppleTable

This method will create an AppleTable for data in DynamoDB table.

The method uses the spark-dynamodb library, of which many of the parameters below are passed to, hence the inclusion of their descriptions from the spark-dynamodb README file

tableName: name of DynamoDB table
tempViewName: temporary table name for source data
firstLevelElementNames: names of the first level elements in the table
delimiter: source data separation character
selectColumns: list of columns to select from table
filter: sql where condition
region: region Spark parameter with the following description from the source README: "sets the region where the dynamodb table. Default is environment specific."
roleArn: roleArn Spark parameter with the following description from the source README: "sets an IAM role to assume. This allows for access to a DynamoDB in a different account than the Spark cluster. Defaults to the standard role configuration."
readPartitions: readPartitions Spark reader parameter with the following description from the source README: "number of partitions to split the initial RDD when loading the data into Spark. Defaults to the size of the DynamoDB table divided into chunks of maxPartitionBytes"
maxPartitionBytes: maxPartitionBytes Spark reader parameter with the following description from the source README: "the maximum size of a single input partition. Default 128 MB"
defaultParallelism: defaultParallelism Spark reader parameter with the following description from the source README: "the number of input partitions that can be read from DynamoDB simultaneously. Defaults to sparkContext.defaultParallelism"
targetCapacity: targetCapacity Spark reader parameter with the following description from the source README: "fraction of provisioned read capacity on the table (or index) to consume for reading. Default 1 (i.e. 100% capacity)."
stronglyConsistentReads: stronglyConsistentReads Spark reader parameter with the following description from the source README: "whether or not to use strongly consistent reads. Default false."
bytesPerRCU: bytesPerRCU Spark reader parameter with the following description from the source README: "number of bytes that can be read per second with a single Read Capacity Unit. Default 4000 (4 KB). This value is multiplied by two when stronglyConsistentReads=false"
filterPushdown: filterPushdown Spark reader parameter with the following description from the source README: "whether or not to use filter pushdown to DynamoDB on scan requests. Default true."
throughput: throughput Spark reader parameter with the following description from the source README: "the desired read throughput to use. It overwrites any calculation used by the package. It is intended to be used with tables that are on-demand. Defaults to 100 for on-demand."
returns: custom table containing the data to be compared

def parallelizeHiveSource(sqlText: String, tempViewName: String): AppleTable

This method will create an AppleTable from a query that retrieves data from a hive table.

This method will create an AppleTable from a query that retrieves data from a hive table. It is assumed that hive connectivity is already enabled in the environment from which this project is run.

sqlText: a query to retrieve the data
tempViewName: custom table name for source
returns: custom table containing the data

def parallelizeJDBCSource(driverClassName: String, jdbcUrl: String, username: String, password: String, sqlQuery: String, tempViewName: String): AppleTable

This method will create an AppleTable from a query that retrieves data from a database accessed through JDBC connection; passes in "," as a default delimiter

driverClassName: JDBC driver name
jdbcUrl: JDBC URL
username: Username for database connection
password: Password for database connection
sqlQuery: Query to retrieve the desired data from database
tempViewName: temporary table name for source data
returns: custom table containing the data to be compared

def parallelizeJDBCSource(driverClassName: String, jdbcUrl: String, username: String, password: String, sqlQuery: String, tempViewName: String, delimiter: Option[String], partitionColumn: String, lowerBound: String, upperBound: String, numPartitions: String): AppleTable

This method will create an AppleTable from a query that retrieves data from a database accessed through JDBC connection.

driverClassName: JDBC driver name
jdbcUrl: JDBC URL
username: Username for database connection
password: Password for database connection
sqlQuery: Query to retrieve the desired data from database
tempViewName: temporary table name for source data
delimiter: source data separation character
partitionColumn: column to partition queries on
lowerBound: lower bound of partition column values
upperBound: upper bound of partition column values
numPartitions: number of total queries to execute
returns: custom table containing the data to be compared

def parallelizeJDBCSource(driverClassName: String, jdbcUrl: String, username: String, password: String, sqlQuery: String, tempViewName: String, delimiter: Option[String]): AppleTable

This method will create an AppleTable from a query that retrieves data from a database accessed through JDBC connection.

driverClassName: JDBC driver name
jdbcUrl: JDBC URL
username: Username for database connection
password: Password for database connection
sqlQuery: Query to retrieve the desired data from database
tempViewName: temporary table name for source data
delimiter: source data separation character
returns: custom table containing the data to be compared

def parallelizeJSONSource(jsonFileLocation: String, tempViewName: String, firstLevelElementNames: Array[String], delimiter: Option[String] = Option.apply(",")): AppleTable

This method will create an AppleTable for data in a JSON file.

jsonFileLocation: path of a json file containing the data to be compared
tempViewName: temporary table name for source data
firstLevelElementNames: Names of the first level elements in the file
delimiter: source data separation character
returns: custom table containing the data to be compared

def parallelizeTextFile(textFileLocation: String): DataFrame

This method will create DataFrame with a single column called "field1" from a text file, the file can be on local machine or HDFS for HDFS url like so "hdfs://nn1home:8020/input/war-and-peace.txt" for S3 the url like so "s3n://myBucket/myFile1.log"

textFileLocation: path of a flat file containing the data to be compared
returns: a dataframe converted from a flat file

def parallelizeTextSource(textFileLocation: String, tempViewName: String): AppleTable

This method will create DataFrame with a single column called "field1" from a text file, the file can be on local machine or HDFS for HDFS url like so "hdfs://nn1home:8020/input/war-and-peace.txt" for S3 the url like so "s3n://myBucket/myFile1.log"

textFileLocation: path of a flat file containing the data to be compared
tempViewName: temporary table name for source data
returns: custom table containing the data to be compared

def simpleTableToSimpleJSONFormatTable(df: DataFrame): DataFrame

Transform single level DataFrame to a JSON format DataFrame

df: single level DataFrame
returns: simplified JSON format DataFrame

var sparkSession: SparkSession

def stopSparkContext(): Unit

Terminates the current Spark session

final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes: AnyRef

def toString(): String

Definition Classes: AnyRef → Any

final def wait(): Unit

Definition Classes: AnyRef
Annotations: @throws( ... )

final def wait(arg0: Long, arg1: Int): Unit

Definition Classes: AnyRef
Annotations: @throws( ... )

final def wait(arg0: Long): Unit

Definition Classes: AnyRef
Annotations: @throws( ... ) @native()

object sparkImplicits extends SQLImplicits

Packages

SparkFactory

object SparkFactory

Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped