ADF: Querying JSON documents

In my previous blog post I talked about how to read from an XML Webervice and use xpath to query the XML on the expressions side of things. You can read the XML article here ( Now, what if we don't have XML but have JSON? Well well indeed, what if there was a way to query JSON documents using a query, imagine if you will a JSONQuery where you can pass a similar query to an xpath query to retrieve specific values from the JSON document.

ADF, XML objects and XPath in the expression language

When you use ADF, there are two sides to the coin. The first is the data itself that ADF does very well, from moving it from one site to another to flattening JSON documents and converting from CSV to Avro, to Parquet, to SQL is powerful. The other side of the coin is how ADF uses data as variables to manage the pipeline, and it is this side of the coin that I wish to talk about today.

Synapse Analytics and .NET for Apache Spark Example 4 - JOINS

This is a bit of a longer one, a look at how to do all the different joins and the exciting thing for MSSQL developers is that we get a couple of extra joins (semi and anti semi oooooooh). T-SQL SELECT * FROM chicago.safety_data one INNER JOIN chicago.safety_data two ON one.Address = two.Address; Spark SQL SELECT * FROM chicago.safety_data one INNER JOIN chicago.safety_data two ON one.Address = two.Address DataFrame API (C#) var dataFrame = spark.

Synapse Analytics and .NET for Apache Spark Example 3 - CTE()

The next example is how to do a CTE (Common Table Expression). When creating the CTE I will also rename one of the columns from “dataType” to “x”. T-SQL WITH CTE(x, dataType, dataSubType) AS ( SELECT dateTime, dataType, dataSubType FROM chicago.safety_data ) SELECT * FROM CTE; Spark SQL WITH CTE AS (SELECT dateTime as x, dataType, dataSubType FROM chicago.safety_data) SELECT * FROM CTE DataFrame API (C#) The DataFrame example is a bit odd - by creating a data frame with the first query we have the CTE that we can use:

Synapse Analytics and .NET for Apache Spark Example 2 - ROW_NUMBER()

The next example is how to do a ROW_NUMBER(), my favourite window function. T-SQL SELECT *, ROW_NUMBER() OVER(ORDER BY dateTime) as RowNumber FROM chicago.safety_data Spark SQL SELECT *, ROW_NUMBER() OVER(ORDER BY dateTime) as RowNumber FROM chicago.safety_data DataFrame API (C#) var dataFrame = spark.Read().Table("chicago.safety_data"); var window = Microsoft.Spark.Sql.Expressions.Window.OrderBy("dateTime"); dataFrame = dataFrame.WithColumn("RowNumber", Functions.RowNumber().Over(window)); dataFrame.Show(); To see this in action, please feel free to deploy this repo to your Synapse Analytics repo:

Synapse Analytics and .NET for Apache Spark Example 1 - Group By

I have been playing around with the new Azure Synapse Analytics, and I realised that this is an excellent opportunity for people to move to Apache Spark. Synapse Analytics ships with .NET for Apache Spark C# support many people will surely try to convert T-SQL code or SSIS code into Apache Spark code. I thought it would be awesome if there were a set of examples of how to do something in T-SQL, then translated into how to do that same thing in Spark SQL and the Spark DataFrame API in C#.

How to get data in a DataFrame via .NET for Apache Spark

When I first started working with Apache Spark, one of the things I struggled with was that I would have some variable or data in my code that I wanted to work on with Apache Spark. To get the data in a state that Apache Spark can process it involves putting the data into a DataFrame. How do you take some data and get it into a DataFrame? This post will cover all the ways to get data into a DataFrame in .

Git in 10 commands to do 99% of gitting and keeping out of trouble

Git is hard, probably harder than it needs to be but I have been using it for about 5 years and have a workflow that works for me. I use git from a command line and have learnt how to use these ten commands. If I need to deal with a merge conflict, I use VS Code and the gitlens extension. If I need to do anything else then I probably copy out the files I want to keep, reset my local repo or delete it and clone a new repo then paste back the files I want to include in the change.

TF-IDF in .NET for Apache Spark Using Spark ML v2

Spark ML in .NET for Apache Spark Apache Spark has had a machine learning API for quite some time and this has been partially implemented in .NET for Apache Spark. In this post we will look at how we can use the Apache Spark ML API from .NET. This is the second version of this post, the first version was written before version 1 of .NET for Apache Spark and there was a vital piece of the implementation missing which meant although we could build the model in .

Blog .NET for Apache Spark UDFs Missing Shared State

The Problem When you use a UDF in .NET for Apache Spark, something like this code: class Program { static void Main(string[] args) { var spark = SparkSession.Builder().GetOrCreate(); _logging.AppendLine("Starting Select"); var udf = Functions.Udf<int, string>(theUdf); spark.Range(100).Select(udf(Functions.Col("id"))).Show(); _logging.AppendLine("Ending Select"); Console.WriteLine(_logging.ToString()); } private static readonly StringBuilder _logging = new StringBuilder(); private static string theUdf(int val) { _logging.AppendLine($"udf passed: {val}"); return $"udf passed {val}"; } } Generally, knowing .NET we would expect the following output: