Bytes & Bites - LIVE: Updates and info from the Datahack

11-17 @ MF building, room A311
Published

April 24, 2026

Theme: Live, info

Today we are hosting our first ever datahack!

There are currently two datasets available and both have some additional contextual information on the project and topic.

REMINDER

This data is NOT to be publicly shared without the agreement of the researcher! If the work does not proceed to a collaboration then get in contact with us about the process of making example datasets that have the same shape as the data without giving the actual data away so you can provide an example dataset with your work.

We will follow up with all those who joined next week (week of the 4th of May) to see who has work to share and if they are open to having it published here on the Bytes & Bites website.

Dataset 1 - Videos from de tweede kamer

This dataset is from Antonis Koutsoumpis from Management, Organisation and SBE. Download the data here or check out a shorter dataset here! The following is context from the researcher:

I would like to create a function that extracts jitter, shimmer, and harmonicity from audio files of continuous speech. Jitter, shimmer, and harmonicity should be extracted from sustained vowels, instead of continuous speech. To achieve the task from continuous speech, we need to identify parts in the continuous speech where participants sustained a vowel sound for some time (e..g, and ‘uh’, ‘ee’, ‘aa’, etc for a few milliseconds, e.g, at least 80 ms). In summary, the script should perform the following: 1) process an audio file and identify speech; 2) identidy parts of the speech where participants sustained a vowel sound (e.g., sustained the vowel ‘aa’ for at least 80 milliseconds); 3) extract jitter, shimmer, and harmonicity (e.g., using an open source voice analysis software such as OpenSmile, Praat, etc.) from those identified parts of the audio file; 4) average those values across the entire audio file per participant; 5) store the output in a csv file. A similar procedure is described in this paper: Nathan, V., Rahman, M. M., Vatanparvar, K., Nemati, E., Blackstock, E., & Kuang, J. (2019, November). Extraction of voice parameters from continuous running speech for pulmonary disease monitoring. In 2019 IEEE international conference on bioinformatics and biomedicine (BIBM) (pp. 859-864). IEEE. link

Dataset 2 - Crystal materials data

This dataset is from Senja Barthel et al. from the Maths department and is about materials science. The main topic of the data is crystalline structures. She gave this description of the project:

The idea of this research project was to use machine learning to investigate in how far the performance in terms ( e.g. gas adsorption (standard is nitrogen, carbon dioxide, methane, or heat capacity) of metal-organic frameworks is determined by the atomic composition of the materials (made in a lego-style fashion using organic lingers that are attached to metal centers), and in how far it is determined by the underlying crystallographic net.

Download the data here and check out the fuller documentation here!

Links for additional context:

On the variablesThe coordination sequence counts how many vertices there are n steps away from any given node. Nodes are considered the same if the symmetry group matches them onto one another.

The vertex symbols encode the length of shortest cycles (with multiplicities if there are several of shortest length) of the graph (i.e. the crystalline net) for any pair of edges at any symmetrically distinct vertex.

Plan for the datasetThe aim was to see in how far the information of the noccectivity of the building blocks can predict the performance parameters, i.e. material properties. And in how far that differs from using the chemical composition of the linkersnodes instead. Basically: what is due to the construction built from the lego, and what is due to the colours of the lego pieces that are used.

That was the plan. So I computed the chemical composition of the building blocks (the color of the legos) which is the ligand and center information. And I calculated descriptors for the crystallographic net (the construction blueprint).

Updates

More info wil be added as it comes in

Pizza

Pizza will be served as always! We expect the delivery around 13:00-13:30 :)