Scala Processor
Scala is a general-purpose programming language built on the Java virtual machine.
It can interact with data that is stored in a distributed manner. Also, it can be used to process and analyze big data.
The Scala processor can be used for writing custom code in Scala language.
Please email to Gathr Support to enable the Scala Processor.
There are several Code Snippets that are available in the application and the same are explained in this topic to get you started with the Scala processor.
Processor Configuration
Configure the processor parameters as explained below.
Package Name
Name of the package for the Scala code class.
Class Name
Name of the Class for the Scala code.
Imports
Import statements for the Scala code.
Input Source
Input Source for the Scala Code.
Scala Code
Scala code to perform the operations on the JSON RDD object.
Ask AI Assistant
Use the AI assistant feature to simplify the creation of Scala queries.
It allows you to generate complex Scala queries effortlessly, using natural language inputs as your guide.
Describe your desired expression in plain, conversational language. The AI assistant will understand your instructions and transform them into a functional SQL query.
Tailor queries to your specific requirements, whether it’s for data transformation, filtering, calculations, or any other processing task.
Note: Press Ctrl + Space to list input columns and Ctrl + Enter to submit your request.
Input Example:
Select those records whose last_login_date is less than 60 days from current_date.
Jar Upload
Jar file is generated when you build Scala code.
Here, you can upload the third party jars so that the API’s can be utilized in Scala code by adding the import statements.
Notes
Optionally, enter notes in the Notes → tab and save the configuration.
Code Snippets
Described below are some sample Scala code use cases:
Add a new column with constant value to existing dataframe
Description
This script demonstrates how to add a column with a constant value.
In the sample code given below, a column Constant_Column with value Constant_Value is added to an existing dataframe.
Sample Code
val sparkSession=dataset.sparkSession
import sparkSession.implicits._
import org.apache.spark.sql.functions._
dataset.withColumn("Constant_Column", lit("Constant_Value"))
Example
Input Data Set
| CustomerId | Surname | CreditScore | Geography | Age | Balance | EstSalary |
|---|---|---|---|---|---|---|
| 15633059 | Fanucci | 413 | France | 34 | 0 | 6534.18 |
| 15604348 | Allard | 710 | Spain | 22 | 0 | 99645.04 |
| 15693683 | Yuille | 814 | Germany | 29 | 97086.4 | 197276.13 |
| 15738721 | Graham | 773 | Spain | 41 | 102827.44 | 64595.25 |
Output Data Set
| CustomerId | Surname | CreditScore | Geography | Age | Balance | EstSalary | Constant_Column |
|---|---|---|---|---|---|---|---|
| 15633059 | Fanucci | 413 | France | 34 | 0 | 6534.18 | Constant_Value |
| 15604348 | Allard | 710 | Spain | 22 | 0 | 99645.04 | Constant_Value |
| 15693683 | Yuille | 814 | Germany | 29 | 97086.4 | 197276.13 | Constant_Value |
| 15738721 | Graham | 773 | Spain | 41 | 102827.44 | 64595.25 | Constant_Value |
Add a new column with random value to existing dataframe
Description
This script demonstrates how to add a column with random values.
Here, column Random_Column is added with random integer values.
Sample Code
val sparkSession=dataset.sparkSession
import sparkSession.implicits._
import org.apache.spark.sql.functions.rand
dataset.withColumn("random", (rand * 100).cast("bigint"))
Example
Input Data Set
| CustomerId | Surname | CreditScore | Geography | Age | Balance | EstSalary |
|---|---|---|---|---|---|---|
| 15633059 | Fanucci | 413 | France | 34 | 0 | 6534.18 |
| 15604348 | Allard | 710 | Spain | 22 | 0 | 99645.04 |
| 15693683 | Yuille | 814 | Germany | 29 | 97086.4 | 197276.13 |
| 15738721 | Graham | 773 | Spain | 41 | 102827.44 | 64595.25 |
Output Data Set
| CustomerId | Surname | CreditScore | Geography | Age | Balance | EstSalary | Random_Column |
|---|---|---|---|---|---|---|---|
| 15633059 | Fanucci | 413 | France | 34 | 0 | 6534.18 | 0.0241309661 |
| 15604348 | Allard | 710 | Spain | 22 | 0 | 99645.04 | 0.5138384557 |
| 15693683 | Yuille | 814 | Germany | 29 | 97086.4 | 197276.13 | 0.2652246569 |
| 15738721 | Graham | 773 | Spain | 41 | 102827.44 | 64595.25 | 0.8454138247 |
Add a new column using expression with existing columns
Description
This script demonstrates how to add new column using existing columns.
Here, column Transformed_Column is added by multiplying columns EstimatedSalary with Tenure.
Sample Code
val sparkSession=dataset.sparkSession
import sparkSession.implicits._
import org.apache.spark.sql.functions._
dataset.withColumn("Transformed_Column", col("EstimatedSalary") * col("Tenure"))
Example
Input Data Set
| CustomerId | Surname | CrScore | Age | Tenure | Balance | EstimatedSalary |
|---|---|---|---|---|---|---|
| 15633059 | Fanucci | 413 | 34 | 9 | 0 | 6534.18 |
| 15604348 | Allard | 710 | 22 | 8 | 0 | 99645.04 |
| 15693683 | Yuille | 814 | 29 | 8 | 97086.4 | 197276.13 |
| 15738721 | Graham | 773 | 41 | 9 | 102827.44 | 64595.25 |
Output Data Set
| CustomerId | Surname | CrScore | Age | Tenure | Balance | EstimatedSalary | Transformed_Column |
|---|---|---|---|---|---|---|---|
| 15633059 | Fanucci | 413 | 34 | 9 | 0 | 6534.18 | 58807.62 |
| 15604348 | Allard | 710 | 22 | 8 | 0 | 99645.04 | 797160.32 |
| 15693683 | Yuille | 814 | 29 | 8 | 97086.4 | 197276.13 | 1578209.04 |
| 15738721 | Graham | 773 | 41 | 9 | 102827.44 | 64595.25 | 581357.25 |
Transform an existing column
Description
This script demonstrates how to transform a column.
Here, rounding off values of column Balance and converting to integer.
Sample Code
val sparkSession=dataset.sparkSession
import sparkSession.implicits._
import org.apache.spark.sql.functions._
dataset.withColumn("Balance" , round(col("Balance")).cast("int"))
Example
Input Data Set
| CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance |
|---|---|---|---|---|---|---|---|
| 15633059 | Fanucci | 413 | France | Male | 34 | 9 | 0 |
| 15604348 | Allard | 710 | Spain | Male | 22 | 8 | 0 |
| 15693683 | Yuille | 814 | Germany | Male | 29 | 8 | 97086.4 |
| 15738721 | Graham | 773 | Spain | Male | 41 | 9 | 102827.44 |
Output Data Set
| CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance |
|---|---|---|---|---|---|---|---|
| 15633059 | Fanucci | 413 | France | Male | 34 | 9 | 0 |
| 15604348 | Allard | 710 | Spain | Male | 22 | 8 | 0 |
| 15693683 | Yuille | 814 | Germany | Male | 29 | 8 | 97086 |
| 15738721 | Graham | 773 | Spain | Male | 41 | 9 | 102827 |
Filter data on basis of some condition
Description
This script demonstrates how to filter data on the basis of some condition.
Here, Customers having Age>30 are selected.
Sample Code
val sparkSession=dataset.sparkSession
import sparkSession.implicits._
dataset.select($"*").filter($"Age">30)
Example
Input Data Set
| CustomerId | Surname | CreditScore | Geography | Gender | Age |
|---|---|---|---|---|---|
| 15633059 | Fanucci | 413 | France | Male | 34 |
| 15604348 | Allard | 710 | Spain | Male | 22 |
| 15693683 | Yuille | 814 | Germany | Male | 29 |
| 15738721 | Graham | 773 | Spain | Male | 41 |
Output Data Set
| CustomerId | Surname | CreditScore | Geography | Gender | Age |
|---|---|---|---|---|---|
| 15633059 | Fanucci | 413 | France | Male | 34 |
| 15738721 | Graham | 773 | Spain | Male | 41 |
If you have any feedback on Gathr documentation, please email us!