tagged [apache-spark-sql]
Overwrite specific partitions in spark dataframe write method
Overwrite specific partitions in spark dataframe write method I want to overwrite specific partitions instead of all in spark. I am trying the following command: where df is dataframe having the incre...
- Modified
- 15 Sep at 10:3
Select columns in PySpark dataframe
Select columns in PySpark dataframe I am looking for a way to select columns of my dataframe in PySpark. For the first row, I know I can use `df.first()`, but not sure about columns given that they do...
- Modified
- 15 Feb at 14:34
Join two data frames, select all columns from one and some columns from the other
Join two data frames, select all columns from one and some columns from the other Let's say I have a spark data frame `df1`, with several columns (among which the column `id`) and data frame `df2` wit...
- Modified
- 25 Dec at 16:27
How to count unique ID after groupBy in pyspark
How to count unique ID after groupBy in pyspark I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year. The problem that I discove...
- Modified
- 17 Feb at 16:44
Filtering a spark dataframe based on date
Filtering a spark dataframe based on date I have a dataframe of I want to select dates before a certain period. I have tried the following with no luck ``` data.filter(data("date")
- Modified
- 1 Dec at 11:25
How do I add a new column to a Spark DataFrame (using PySpark)?
How do I add a new column to a Spark DataFrame (using PySpark)? I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column. I've tried the following without any success: ``` typ...
- Modified
- 5 Jan at 01:51
Fetching distinct values on a column using Spark DataFrame
Fetching distinct values on a column using Spark DataFrame Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column ...
- Modified
- 15 Sep at 10:11
How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?
How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? dataframe with count of nan/null for e
- Modified
- 20 Apr at 11:3
Filter Pyspark dataframe column with None value
Filter Pyspark dataframe column with None value I'm trying to filter a PySpark dataframe that has `None` as a row value: and I can filter correctly with an string value: ``` df[d
- Modified
- 5 Jan at 06:30
multiple conditions for filter in spark data frames
multiple conditions for filter in spark data frames I have a data frame with four fields. one of the field name is Status and i am trying to use a OR condition in .filter for a dataframe . I tried bel...
- Modified
- 15 Sep at 10:8
Filtering a pyspark dataframe using isin by exclusion
Filtering a pyspark dataframe using isin by exclusion I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion). As an example: I get the da...
- Modified
- 21 Jan at 14:22
Converting Pandas dataframe into Spark dataframe error
Converting Pandas dataframe into Spark dataframe error I'm trying to convert Pandas DF into Spark one. DF head: Code: ``` dataset = pd.read_csv("data/AS/test_v2.csv") sc =
- Modified
- 20 Mar at 06:43
dataframe: how to groupBy/count then filter on count in Scala
dataframe: how to groupBy/count then filter on count in Scala Spark 1.4.1 I encounter a situation where grouping by a dataframe, then counting and filtering on the 'count' column raises the exception ...
- Modified
- 20 Aug at 13:46
Iterate rows and columns in Spark dataframe
Iterate rows and columns in Spark dataframe I have the following Spark dataframe that is created dynamically: ``` val sf1 = StructField("name", StringType, nullable = true) val sf2 = StructField("sect...
- Modified
- 15 Sep at 10:12
Provide schema while reading csv file as a dataframe in Scala Spark
Provide schema while reading csv file as a dataframe in Scala Spark I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should be since I know my csv file. Also I a...
- Modified
- 16 Aug at 16:17
Removing duplicate columns after a DF join in Spark
Removing duplicate columns after a DF join in Spark When you join two DFs with similar column names: Join works fine but you can't call the `id` column because it is ambiguous and you would get the fo...
- Modified
- 25 Dec at 16:33
How to select the first row of each group?
How to select the first row of each group? I have a DataFrame generated as follow: The results look like: ``` +----+--------+----------+ |Hour|Category|TotalValue| +----+--------+----------+ | 0| ca...
- Modified
- 7 Jan at 15:39
Best way to get the max value in a Spark dataframe column
Best way to get the max value in a Spark dataframe column I'm trying to figure out the best way to get the largest value in a Spark dataframe column. Consider the following example: Which creates: My ...
- Modified
- 24 Sep at 08:7
How to add a constant column in a Spark DataFrame?
How to add a constant column in a Spark DataFrame? I want to add a column in a `DataFrame` with some arbitrary value (that is the same for each row). I get an error when I use `withColumn` as follows:...
- Modified
- 7 Jan at 15:27
Spark Dataframe distinguish columns with duplicated name
Spark Dataframe distinguish columns with duplicated name So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot: ``` [ Row(a=107831, f=S...
- Modified
- 5 Jan at 16:0
how to filter out a null value from spark dataframe
how to filter out a null value from spark dataframe I created a dataframe in spark with the following schema: ``` root |-- user_id: long (nullable = false) |-- event_id: long (nullable = false) |-- in...
- Modified
- 15 Sep at 10:7
Concatenate two PySpark dataframes
Concatenate two PySpark dataframes I'm trying to concatenate two PySpark dataframes with some columns that are only on one of them: ``` from pyspark.sql.functions import randn, rand df_1 = sqlContext....
- Modified
- 25 Dec at 16:26