In the first two parts of this series (https://the.agilesql.club/2019/07/how-do-we-test-etl-pipelines-part-one-unit-tests/ and https://the.agilesql.club/2019/08/how-do-we-prove-our-etl-processes-are-correct-how-do-we-make-sure-upstream-changes-dont-break-our-processes-and-break-our-beautiful-data/), I talked about how to unit test your business logic and integration test your ETL infrastructure code. Having these tests ensures that your code is in order, it means you have documented, and future-proofed your code which is a fantastic thing to have. What testing our code doesn't give us is a way to validate the data we receive is correct.
I finally got around to updating the tSQLt test adapter for visual studio, you can download it from: https://marketplace.visualstudio.com/items?itemName=vs-publisher-263684.GoEddietSQLt2019 or the search in visual studio extensions thingy finds it as well. For details on what this is and how it works see the original post: https://the.agilesql.club/2016/08/tsqlt-visual-studio-test-adapter/
Steps needed Getting Apache Spark running on windows involves: Installing a JRE 8 (Java 1.8/OpenJDK 8) Downloading and extracting SPARK and setting SPARK_HOME Downloading winutils.exe and setting HADOOP_HOME If using the dotnet driver also downloading the Microsoft.Spark.Worker and setting DOTNET_WORKER_DIR if you are going to use UDF's Making sure java and %SPARK_HOME%\bin are on your path There are some pretty common mistakes people make (myself included!), most common I have seen recently have been having a semi-colon in JAVA_HOME/SPARK_HOME/HADOOP_HOME or having HADOOP_HOME not point to a directory with a bin folder which contains winutils.
When you run an application using spark-dotnet, to launch the application you need to use spark-submit to start a java virtual machine which starts the spark-dotnet driver which then runs your program so that leaves us a problem, how to write our programs in visual studio and press f5 to debug? There are two approaches, one I have used for years with dotnet when I want to debug something that is challenging to get a debugger attached - think apps which spawn other processes and they fail in the startup routine.
ETL Testing Part 2 - Operational Data Testing This is the second part of a series on ETL testing, the first part explained about unit testing, and in this part, we will talk about how we can prove the correctness of the actual data, both today and in the future after every ETL run. Testing ETL processes is a multi-layered beast, we need to understand the different types of test, what they do for us, and how to actually implement them.
I found this question on stack overflow that went something like this: “I have a file that includes line endings in the wrong place and I need to parse the text manually into rows” (https://stackoverflow.com/questions/57294619/read-a-textfile-of-fixed-length-with-newline-as-one-of-attribute-value-into-a-ja/57317527). I thought it would be interesting to implement this with what we have available today in spark-dotnet. The thing is though that even though this is possible in spark-dotnet or the other versions of spark, I would pre-process the file in something else and by the time spark reads the file have it already in a suitable format.
There was a breaking change with version 0.4.0 that changed the name of the class that is used to load the dotnet driver in Apache Spark. To fix the issue you need to use the new package name which adds an extra dotnet near the end, change: spark-submit --class org.apache.spark.deploy.DotnetRunner into: spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner What if I have this error but that doesn't fix it? When you run a spark app using spark-submit and you get a ClassNotFoundException for the driver then it boils down to either making a typo or something on your system blocking the jar from being loaded (anti-virus?
Why do we bother testing? Testing isn’t an easy thing to define, we all know we should do it, when something goes wrong in production people shout and ask where the tests were, hell even auditors like to see evidence of tests (whether or not they are good isn't generally part of an audit) . What do we test, how and why do we even write tests? It is all well and good saying “write unit tests and integration tests” but what do we test?
How do you read and write CSV files using the dotnet driver for Apache Spark? I have a runnable example here: https://github.com/GoEddie/dotnet-spark-examples Specifcally: https://github.com/GoEddie/dotnet-spark-examples/tree/master/examples/split-csv Let's take a walkthrough of the demo: Console.WriteLine("Hello Spark!"); var spark = SparkSession .Builder() .GetOrCreate(); We start with the obligatory “Hello World!", then we create a new SparkSession. //Read a single CSV file var source = spark .Read() .Option("header", true) .Option("inferShchema", true) .Option("ignoreLeadingWhiteSpace", true) .Option("ignoreTrailingWhiteSpace", true) .
Apache Spark is written in scala, scala compiles to Java and runs inside a Java virtual machine. The spark-dotnet driver runs dotnet code and calls spark functionality, so how does that work? There are two paths to run dotnet code with spark, the first is the general case which I will describe here, the second is UDF's which I will explain in a later post as it is slightly more involved.