Dutch degrees are notorious for their attrition rate and a lot of people switch degrees. I’ve switched twice myself, going from CS to AI and then to Software Engineering. I figured that it could be interesting to look what everyone was doing now. Most people generally do find their way to the finish line and there is also a substantial part that starts working and doesnt finish their degree. Interestingly, the Hanze University of Applied Science had the same idea and scraped linkedIn profiles of all their alumni - assumingly me included. They found that the top 3 roles for software engineers were:
This is part 1 in a series on this project, more posts will be written as the project progresses
I’m a bit of an lousy birder. While I’ve always been interested in birds I’m almost completely deaf to identifying them by their calls. I never started memorizing them and my ability to recognize them on the basis of visual cues is poor. It was only when I started kayaking (about five years ago) that I was exposed to more bird-watching, and during a trip to Schotland last year a couple of friends took me on my first bird watching trip.
So given that my human-based detection is obviously lacking I figured that with the plethora of bird data online, would it perhaps not be possible to scrape these and create a bird classifier using CNNs and general Python shenanigans?
Whoooo! I passed the GCP data engineer exam. In truth, I am not really a data engineer.. but it is a role I can fill and I strongly feel that as a data scientist you should be able to have a solid grasp of engineering too, or at least understand it well enough to know what the challenges are. The exam was about 2 hours long and is supposed to be roughly equivalent to the AWS professional level certifications. Funny enough, it actually covers a lot of the ML engineering topics. Perhaps there’s a ML exam in the works?
I’ve just added three blog posts I made during the Big Data bachelor course given at the Radboud university. As a master’s student I’m allowed to take on one or two bachelor courses if there’s a good reason… because no other course really goes into Spark, hadoop and Scala I figured it would be a nice addition to the Python-heavy curriculum. Not that I dislike Python, of course.
There are three posts in total:
Hadoop and the HDFS - an introduction to hadoop and HDFS.
Spark - On looking at a Kaggle competition data set in Spark
The class project: A solo project about submitting code to a national research cluster and running queries against 1.73 billion web pages.
You can find the posts here: Big Data Series
This post will have a slightly different angle than the previous posts in the Big Data Course series. The goal for this post is just to detail my progress on a self-chosen, free format project which utilizes the Surfsara Hadoop cluster and the goal is not to solve a problem but rather give an overview of the problems I encountered and the little things I came up with. I intend to post these both on the mini-site for the course and a personal blog, my apologies if my tone is a bit bland as a result. Here we go!
SurfSara is a Dutch institute that provides web and data services to universities and schools. Students may know SURF from the cheap software or the internet they provide to high schools. Sara, though is the high performance computing department, and used to be the academic center for computing prior to merging into SURF. They do a lot of cool things with big data which over time has come to include a Hadoop cluster named Hathi.
The Hathi cluster hosts a February 2016 collection from the Common Crawl. The Common Crawl is a collection of crawled web pages which comes pretty close to crawling the entirety of the web. The data hosted is in the petabyte range, however we only have access to a single snapshot.. which still takes up a good amount of terabytes and contains 1.73 billion urls. You don’t want to download this on your mobile phone’s data cap.
The Common Crawl Data is stored in WARC files (Web Archive), an open-source format.
So with all this data, there should be a lot of things to do!
So the third assignment is running Spark and playing around with it. The first part was basically messing about with the query-processing, the second part is playing with data and dataframes. As these do not actually seem to be part of the required stuff for the post, I have left them out completely.
The way I understand this is that I’m supposed to play with Spark, come up with something new, and write a short blog post detailing my experiences.
Alternatively, you could decide to carry out a small analysis of a different open dataset, of your own choice; and present to your readers some basic properties of that data. You will notice that it is harder than following instructions, and you run the risk of getting stuck because the data does not parse well, etc.
So without further ado, let’s explore some datas.
Kaggle is one of the top resources for Data Science competitions, where data scientists, analysts, and programmers of all flavors unite and compete for prizes. While IMO prize money mostly seems to go to people who already have top-tier knowledge (like people who work at Deepmind) or just a lot of time/resources behind them (I recall reading some people spend 5 hours a day on a competition, which would probably make the pay-off very poor for their time investment), it’s kind of a data geek playground. I have selected the Sberbank competition which was launched recently.
The first step is simply downloading the accepting the conditions of the competition, and downloading a zip file for training and test data. Additional data about Russia’s economy is available in different files, and the file description hints that these may be joined together with the proper instructions. All the data is in comma-seperated values. The training data is 44 MB. Because one cay only download the data once authenticated and I’m using a virtual machine, I’ve hosted the dataset elsewhere before pulling it in with
wget as this seems an easier solution than setting up a shared partition.
Last year, the Dutch Rijkswaterstaat (a part of the Dutch Ministry of Infrastructure and the Environment) released a website where you could track where salt scattering trucks - also known as gritters - moved in real-time. This is particularly useful as Dutch infrastructure always seems to shut down completely during the first days of mild snow and you need to know if there’s a chance you might make it to work today.
The Rijkswaterstaat website features a Google Maps widget that shows which trucks are active and moving and which are not.
The website features an API, which while not publicly advertised can be found by opening dev tools in any modern browser and looking at the requests made by the page. This url can then be approached via the URL.
Snooping around I found a nice stream of JSON data:
I just got Hexo up and running and it seems to work smoothly. I was pretty excited to find a blog based on node.js and coming from a world of Wordpress it was amusing that people actually ported Wordpress themes to Hexo. Normally I would have preferred a much more minimalistic skin, but as I hope to post about outdoors stuff too, I figured I could use some color in my theme.