Loud and clear: Data Scientist Training for Librarians (DST4L) is a wonderful concept. It’s an experimental course developed at Harvard-Smithsonian Center for Astrophysics Library with the aim to train librarians to respond to the growing data needs of there communities. We all know the story: The Internet happened and the amount of information and data exploded. And it’s right in front of everybody’s nose – the neighbor kid as well as the scholar. Information and data in any form is the building bricks of science and knowledge and with the rapid increase of these the need to gain tools and skills to tame and analyze them grow. This development has changed the way academia works and when academia changes the library should pay attention and think of it’s options.
DST4L – Copenhagen 9th – 11th of September 2015
DST4L has been held three times in The States and was to be set for the first time in Europe at Library of Technical University of Denmark just outside of Copenhagen. 40 participants from all across Europe were ready to get there hands dirty over three days marathon of relevant tools within data archiving, handling, sharing and analyzing. See the full program here and check the #DST4L hashtag at Twitter.
Day 1: OpenRefine
OpenRefine is an open source desktop application for data cleanup and transformation to other formats. It is similar to spreadsheet applications and can work with spreadsheet file formats). Unlike spreadsheets, no formulas are stored in the cells, but formulas are used to transform the data, and transformation is done only once. Transformation expressions can be written in Google Refine Expression Language (GREL), Jython (i.e. Python) and Clojure.
How can it be used in libraries and academia?: When scraping for instance tweets from Twitter you often end up with a dirty dataset with alot of noise that I want to lead out before starting analyzing my data. OpenRefine can be used for this.
At the end of day 1 we throw a social event with drinks ‘n data at our Digtial Social Science Lab (DSSL) at The Faculty Library of Social Sciences, Copenhagen University library. DSSL project head Michael Svendsen and I gave a quick talk on the concept of DSSL and then there were bubbles and lots of talk before we hit a restaurant for something to eat. Great ending of a great day.
Day 2: GitHub
GitHub is a web-based collaborative platform for code management and code review for open source and private projects. Public projects are for free and private comes with a fee. According to GitHub there are having 9 million users and over 21 million repositories which makes them the largest host of source code in the world.
How can it be used in libraries and academia?: For one thing GitHub can function as a strong collaborative platform within various academic disciplines and for libraries there is an opportunity to support this with know-how skills on GitHub. For libraries themselves GitHub can be used for developing and sharing good stuff liberated from time and place. For instance: How many LibGuides on e.g. sociology are there around the world build up from scratch? A lot. If libraries used GitHub to share and develop only one prototype LibGuide for Sociology this could be used as a strong starting point for all LibGuides on Sociology around the world.
Day 3: Python
Python is a programming language which syntax allows programmers to express concepts in fewer lines of code than would be possible in languages like Java or C++. Python interpreters are available for installation on many operating systems, allowing Python code execution on a wide variety of systems. Using third-party tools, such as Py2exe or Pyinstaller, Python code can be packaged into stand-alone executable programs for some of the most popular operating systems, allowing the distribution of Python-based software for use on those environments without requiring the installation of a Python interpreter.
How can it be used in libraries and academia?: Being a programming language Python can be used for many things but one thing that for me stands out as a good way to use Python is web scraping. The web contains huge amounts of data which is relevant for researchers and students. Let’s say you want to scrape tweets on the danish election to analyze on various parameters. Python can help you do that and when you got your dataset you can clean it up in OpenRefine before analyzing in for instance NVivo.
Here are some good blog post on how to get started on web scraping with Python:
Ending notes and the future of DST4L
DST4L is important because it clearly addresses some of the key skills librarians within academia will need to gain to continue to create value to there institutions – and not only on a strategical level, you actually learn how to use stuff like OpenRefine, GitHub and Python. That said the learning curve is pretty steep, at least it was for me, and I’m no master of the things I learned. But that is important is that I now have a basic understanding of what we a capable of doing with these tools and we are standing on a very solid platform for building a service towards our university and faculties on these matters.
DST4L has been brought to Copenhagen by a couple of great data enthusiasts (hands up for Chris Erdmann, Ivo Grigorov and Mikael Elbæk) and I’m thankful that people like them put effort and time into a concept that brings so much value to the table. But the question that stands after 3 days of Data Scientist Training is how will use the time and energy to make sure the next DST4L i happening? DST4L is important for the future of libraries but to survive I guess it has to be lifted out of the hands of awesome enthusiasts and into an sustainable organizational structure that provides the world of librarianship with great data scientist training. Maybe a task for OCLC or another major worldwide library player.
For now: thanks for some great and valuable DST4L days in Copenhagen. Hope that there will be many many DST4L sessions in the future.