Packages

o

org.finra.msd.sparkfactory

SparkFactory

object SparkFactory

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. SparkFactory
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  6. var conf: SparkConf
  7. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  8. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  9. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  10. def flattenDataFrame(df: DataFrame, delimiter: String): DataFrame

    Flattens a DataFrame to only have a single column that contains the entire original row

    Flattens a DataFrame to only have a single column that contains the entire original row

    df

    a dataframe which contains one or more columns

    delimiter

    a character which separates the source data

    returns

    flattened dataframe

  11. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  12. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  13. def initializeDataBricks(dataBricksSparkSession: SparkSession): Unit
  14. def initializeSparkContext(): Unit

    The initialize method creates the main spark session which MegaSparkDiff uses.

    The initialize method creates the main spark session which MegaSparkDiff uses. This needs to be called before any operation can be made using MegaSparkDiff. It creates a spark app with the name "megasparkdiff" and enables hive support by default.

  15. def initializeSparkLocalMode(numCores: String, logLevel: String, defaultPartitions: String): Unit

    This method should be used to initialize MegaSparkDiff in local mode, in other words anything that is not EMR or EC2.

    This method should be used to initialize MegaSparkDiff in local mode, in other words anything that is not EMR or EC2. The typical use cases are if you are executing a diff on your laptop or workstation or within a Jenkins build.

    numCores

    this parameters can be used to set the number of cores you wanna specify for spark. for example you can specify "local[1]" this means spark will use 1 core only. Alternatively you can specify "local[*]" this means spark will figure out how many cores you have and will use them all.

  16. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  17. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  18. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  19. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  20. def parallelizeCSVSource(filePath: String, tempViewName: String, schemaDef: Option[StructType] = Option.empty, delimiter: Option[String] = Option.apply(",")): AppleTable

    This method creates an AppleTable for data in a CSV file

    This method creates an AppleTable for data in a CSV file

    filePath

    the relative path to the CSV file

    tempViewName

    the name of the temporary view which gets created for source data

    schemaDef

    optional schema definition for the data. If none is provided, the schema will be inferred by spark

    delimiter

    optional delimiter character for the data. If none is provided, comma (",") will be used as default

    returns

    custom table containing the data to be compared

  21. def parallelizeDynamoDBSource(tableName: String, tempViewName: String, firstLevelElementNames: Array[String], delimiter: Option[String] = Option.apply(","), selectColumns: Option[Array[String]] = Option.empty, filter: Option[String] = Option.empty, region: Option[String] = Option.empty, roleArn: Option[String] = Option.empty, readPartitions: Option[String] = Option.empty, maxPartitionBytes: Option[String] = Option.empty, defaultParallelism: Option[String] = Option.empty, targetCapacity: Option[String] = Option.empty, stronglyConsistentReads: Option[String] = Option.empty, bytesPerRCU: Option[String] = Option.empty, filterPushdown: Option[String] = Option.empty, throughput: Option[String] = Option.empty): AppleTable

    This method will create an AppleTable for data in DynamoDB table.

    This method will create an AppleTable for data in DynamoDB table.

    The method uses the spark-dynamodb library, of which many of the parameters below are passed to, hence the inclusion of their descriptions from the spark-dynamodb README file

    tableName

    name of DynamoDB table

    tempViewName

    temporary table name for source data

    firstLevelElementNames

    names of the first level elements in the table

    delimiter

    source data separation character

    selectColumns

    list of columns to select from table

    filter

    sql where condition

    region

    region Spark parameter with the following description from the source README: "sets the region where the dynamodb table. Default is environment specific."

    roleArn

    roleArn Spark parameter with the following description from the source README: "sets an IAM role to assume. This allows for access to a DynamoDB in a different account than the Spark cluster. Defaults to the standard role configuration."

    readPartitions

    readPartitions Spark reader parameter with the following description from the source README: "number of partitions to split the initial RDD when loading the data into Spark. Defaults to the size of the DynamoDB table divided into chunks of maxPartitionBytes"

    maxPartitionBytes

    maxPartitionBytes Spark reader parameter with the following description from the source README: "the maximum size of a single input partition. Default 128 MB"

    defaultParallelism

    defaultParallelism Spark reader parameter with the following description from the source README: "the number of input partitions that can be read from DynamoDB simultaneously. Defaults to sparkContext.defaultParallelism"

    targetCapacity

    targetCapacity Spark reader parameter with the following description from the source README: "fraction of provisioned read capacity on the table (or index) to consume for reading. Default 1 (i.e. 100% capacity)."

    stronglyConsistentReads

    stronglyConsistentReads Spark reader parameter with the following description from the source README: "whether or not to use strongly consistent reads. Default false."

    bytesPerRCU

    bytesPerRCU Spark reader parameter with the following description from the source README: "number of bytes that can be read per second with a single Read Capacity Unit. Default 4000 (4 KB). This value is multiplied by two when stronglyConsistentReads=false"

    filterPushdown

    filterPushdown Spark reader parameter with the following description from the source README: "whether or not to use filter pushdown to DynamoDB on scan requests. Default true."

    throughput

    throughput Spark reader parameter with the following description from the source README: "the desired read throughput to use. It overwrites any calculation used by the package. It is intended to be used with tables that are on-demand. Defaults to 100 for on-demand."

    returns

    custom table containing the data to be compared

  22. def parallelizeHiveSource(sqlText: String, tempViewName: String): AppleTable

    This method will create an AppleTable from a query that retrieves data from a hive table.

    This method will create an AppleTable from a query that retrieves data from a hive table. It is assumed that hive connectivity is already enabled in the environment from which this project is run.

    sqlText

    a query to retrieve the data

    tempViewName

    custom table name for source

    returns

    custom table containing the data

  23. def parallelizeJDBCSource(driverClassName: String, jdbcUrl: String, username: String, password: String, sqlQuery: String, tempViewName: String): AppleTable

    This method will create an AppleTable from a query that retrieves data from a database accessed through JDBC connection; passes in "," as a default delimiter

    This method will create an AppleTable from a query that retrieves data from a database accessed through JDBC connection; passes in "," as a default delimiter

    driverClassName

    JDBC driver name

    jdbcUrl

    JDBC URL

    username

    Username for database connection

    password

    Password for database connection

    sqlQuery

    Query to retrieve the desired data from database

    tempViewName

    temporary table name for source data

    returns

    custom table containing the data to be compared

  24. def parallelizeJDBCSource(driverClassName: String, jdbcUrl: String, username: String, password: String, sqlQuery: String, tempViewName: String, delimiter: Option[String], partitionColumn: String, lowerBound: String, upperBound: String, numPartitions: String): AppleTable

    This method will create an AppleTable from a query that retrieves data from a database accessed through JDBC connection.

    This method will create an AppleTable from a query that retrieves data from a database accessed through JDBC connection.

    driverClassName

    JDBC driver name

    jdbcUrl

    JDBC URL

    username

    Username for database connection

    password

    Password for database connection

    sqlQuery

    Query to retrieve the desired data from database

    tempViewName

    temporary table name for source data

    delimiter

    source data separation character

    partitionColumn

    column to partition queries on

    lowerBound

    lower bound of partition column values

    upperBound

    upper bound of partition column values

    numPartitions

    number of total queries to execute

    returns

    custom table containing the data to be compared

  25. def parallelizeJDBCSource(driverClassName: String, jdbcUrl: String, username: String, password: String, sqlQuery: String, tempViewName: String, delimiter: Option[String]): AppleTable

    This method will create an AppleTable from a query that retrieves data from a database accessed through JDBC connection.

    This method will create an AppleTable from a query that retrieves data from a database accessed through JDBC connection.

    driverClassName

    JDBC driver name

    jdbcUrl

    JDBC URL

    username

    Username for database connection

    password

    Password for database connection

    sqlQuery

    Query to retrieve the desired data from database

    tempViewName

    temporary table name for source data

    delimiter

    source data separation character

    returns

    custom table containing the data to be compared

  26. def parallelizeJSONSource(jsonFileLocation: String, tempViewName: String, firstLevelElementNames: Array[String], delimiter: Option[String] = Option.apply(",")): AppleTable

    This method will create an AppleTable for data in a JSON file.

    This method will create an AppleTable for data in a JSON file.

    jsonFileLocation

    path of a json file containing the data to be compared

    tempViewName

    temporary table name for source data

    firstLevelElementNames

    Names of the first level elements in the file

    delimiter

    source data separation character

    returns

    custom table containing the data to be compared

  27. def parallelizeTextFile(textFileLocation: String): DataFrame

    This method will create DataFrame with a single column called "field1" from a text file, the file can be on local machine or HDFS for HDFS url like so "hdfs://nn1home:8020/input/war-and-peace.txt" for S3 the url like so "s3n://myBucket/myFile1.log"

    This method will create DataFrame with a single column called "field1" from a text file, the file can be on local machine or HDFS for HDFS url like so "hdfs://nn1home:8020/input/war-and-peace.txt" for S3 the url like so "s3n://myBucket/myFile1.log"

    textFileLocation

    path of a flat file containing the data to be compared

    returns

    a dataframe converted from a flat file

  28. def parallelizeTextSource(textFileLocation: String, tempViewName: String): AppleTable

    This method will create DataFrame with a single column called "field1" from a text file, the file can be on local machine or HDFS for HDFS url like so "hdfs://nn1home:8020/input/war-and-peace.txt" for S3 the url like so "s3n://myBucket/myFile1.log"

    This method will create DataFrame with a single column called "field1" from a text file, the file can be on local machine or HDFS for HDFS url like so "hdfs://nn1home:8020/input/war-and-peace.txt" for S3 the url like so "s3n://myBucket/myFile1.log"

    textFileLocation

    path of a flat file containing the data to be compared

    tempViewName

    temporary table name for source data

    returns

    custom table containing the data to be compared

  29. def simpleTableToSimpleJSONFormatTable(df: DataFrame): DataFrame

    Transform single level DataFrame to a JSON format DataFrame

    Transform single level DataFrame to a JSON format DataFrame

    df

    single level DataFrame

    returns

    simplified JSON format DataFrame

  30. var sparkSession: SparkSession
  31. def stopSparkContext(): Unit

    Terminates the current Spark session

  32. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  33. def toString(): String
    Definition Classes
    AnyRef → Any
  34. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  35. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  36. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  37. object sparkImplicits extends SQLImplicits

Inherited from AnyRef

Inherited from Any

Ungrouped