Write a pyspark dataframe query to find all duplicate emails | IBM Interview Question |

Опубликовано: 09 Октябрь 2024
на канале: GeekCoders

5,002

data = [(1, "[email protected]"), (2, "[email protected]"), (3, "[email protected]")]
schema = "ID int,email string"
df = spark.createDataFrame(data, schema)

Course Link:
https://www.geekcoders.co.in/courses/...

This code does the following:

Imports necessary libraries and initializes a SparkSession.
Creates a sample DataFrame with a column named "email" containing email addresses.
Groups the DataFrame by the "email" column.
Counts the occurrences of each email address.
Filters the result to include only rows with a count greater than 1 (i.e., the duplicates).
Finally, it shows the duplicate email addresses.
You can replace the sample DataFrame (df) with your actual DataFrame if you have one. This code will help you identify duplicate email addresses within your DataFrame.

#pysparkinterview #pysparkforbeginners