Kaggle, a platform best known for its data analytics competitions, recently released a report compiling information on data analytics professionals from around the world. Who are these people? What age are they, what education do they have, what tools do they use on a daily basis, what problems do they face most often? And most importantly: how much do they earn? And how does the situation in Poland compare to that in the world, and what skills should a person new to the industry invest in? Look for the answers to these questions in the text below. Recently, Kaggle, a platform known mainly for brokering Data Science competitions, but also for sharing many interesting datasets and bringing together a community of analysts, released an interesting report on who are and what exactly are the people involved in data analysis and machine learning in the broadest sense. Kaggle sent out questionnaires to employees in various industries in a number of different countries, asking about their education, age, the tools they use, their salaries, and the most common problems in their daily work. They managed to get more than 16,000 responses this way. It is worth noting here that the survey had one very important caveat: Data Scientist was defined as "a person who writes code to analyze data," so let's keep in mind that the people surveyed do not perform analysis through visual tools where you can "punch in" the dependencies you are looking for, but by writing code. The results of the surveys, in addition to shedding light on this sector of the labor market in general, can also answer a very important question for all those who are just getting into the profession or who want to change careers: how should a person new to the industry enter the field? Besides, the report on the Kaggle website focused on results from around the world, while we will focus most of our attention on the domestic market.
Data Scientists in Poland
Let's start with the basic information, i.e. what age and gender the respondents in Poland were. The median age for all respondents was 29 (this means that half of the respondents were younger and half were older than 29). When we compare the median between men and women, we find that the former had it at 29, while women --- 27. Delving deeper into the topic of why men have the same value as the data for all, we can find out that male respondents in Poland were 159, while female respondents were --- 20. This imbalance persists globally as well: among the total number of surveys, 16,385, men were 13,427, while women were only 2,714. In terms of age spread, Polish respondents were in the 18-60 age range, with the vast majority closing in on the 24-40 range. We can see a fairly similar trend for the results from around the world.
Another interesting fact that interviewers asked about was the contract under which data analysts work. In Poland, 69% work full-time on a contract, followed by freelancers and part-time employees. That's a bit different from the world, where the second place goes to the answer "not employed, but looking for a job." Could it be that in Poland, everyone who wants to work in the industry is finding a job? Or are there so few people on the job market with the right skills? Finally, it's worth looking at the data on women in Poland: while the majority work full-time (but still less than men), the next 3 responses ex aequo are occupied (with results of 15% each) by part-time employment, freelancers and the unemployed who are not currently looking for work. A possible explanation for these results is that women, while raising children, are unable or unwilling to engage in full-time work, especially given the age of the respondents.
What are the salaries of Data Scientists?
Let's move on to what electrifies with such surveys the most: earnings. In Poland, the median annual earnings are $23,894 (unfortunately, the report doesn't say whether these are net or gross amounts), which translates to about 85 thousand zlotys. That comes out to 7 thousand zlotys per month. Pretty good for Polish conditions. The results of the survey indicate that most of the earnings of Polish data analysts fall below twice that amount, or $50,000 per year (literally 7 people earn more). But how do these amounts compare to earnings in other countries? In the U.S., median earnings are as high as $110,000, twice the maximum earnings in Poland. The amounts stretch to around $200,000. However, if we wouldn't want to look overseas, let's compare our earnings with our two neighbors: in Germany, median earnings are $71,750, with maximum salaries reaching $150,000. The fact that our western neighbors earn more probably doesn't surprise us. What may come as a surprise, on the other hand, is that in Ukraine the median is... $44,300, with the better off earning around $75,000. This is very similar to, for example, Spain. However, it's worth noting that only 21 Ukrainians chose to report their earnings, so the data may not be representative.
Now let's look at the respondents' education. Here there is no surprise: the largest number, 57.5%, have a master's degree. Next are bachelor's and engineer's degrees, with 21.3%, and in third place are PhDs, with 12.5%. Of course, it's worth noting that people with a master's degree, for example, may have it in some field other than mathematics or computer science, such as economics, and then retrain, which happens quite often. It's perhaps worth noting that when we look at the data from around the world, the first three places are the same, but the share of people with bachelor's degrees is higher: there are 32% of them, while master's degrees are only 41.8%.
Most commonly used methods and tools
Once we know who the typical data analytics person is in Poland and around the world, we can move on to more industry-specific information gathered in the survey. The first question in this area was about what method respondents most often use. It turns out that the simplest methods are the best, or at least the most popular: most respondents, regardless of the industry sector they represented, reported logistics regression. The only exception was military/security, where neural networks were used more often. Could it be about recognizing objects in satellite images, such as rocket launchers? Or was it about recognizing the faces of wanted terrorists in videos captured on city cameras? After all, it's not well known today that neural networks work very well for image processing. Next in line were decision trees and the ever-popular random trees. Interestingly, "ensemble methods," which I don't bother to translate into Polish, i.e. methods that involve using a couple of different models and then relying on a fusion of their results, ranked only 6th. This is interesting, given that it is usually these methods that win the various competitions involving prediction.
The next few pieces of information may be of interest to those who have been in the industry for a while, as well as those who are just taking their first steps. When asked what tools (including programming languages) data analysts use, by far the first place went to Python with a score of 76%. It should be noted here that this was a multiple-choice question, so the results don't add up to 100%, but that doesn't change the fact that 76% of people use Python, among many different tools. The next most popular tools were R (a programming language competing with Python), as well as SQL. The rest of the answers are far behind. This means that a person who wants to acquire skills that are actually used in the job market should learn one of the two programming languages (Python will come in handy more often), as well as database queries.
Since we're on database queries, the next question was about what type of data data analysts use. The vast majority of respondents answered that their data is in relational databases, which only confirms that it is knowledge of these that will come in handy for those who would like to seek employment in this sector. The second place went to textual data (that is, .csv and similar files), and far behind were images and video.
One interesting piece of information was how code is shared among those asked. By far the most common answer was Git. No wonder: it is currently the most popular version control system. Not only does it allow us to easily hand over our code to a colleague in the department next to us, but it also allows us to roll back to the version before the foul change after which the whole program stopped working, as well as to maintain several versions at the same time when we decided to "fork" our code to try different ways to solve our problem. So, as you can see, Git is a powerful tool that offers many possibilities and is widely used among data analysts. All the more reason to learn it. Interestingly, the second most common response was to transfer data bypassing the cloud, such as via email. Large companies, in particular, appeared to be willing to use such a rather outdated method of code transfer.
Main problems in the work of Data Scientists
An interesting question was the one about the obstacles faced by data analysts. The main one turned out to be unclean data. This refers to data that contains information, but its form does not allow to use it directly. An example would be, for example, text denoting an address that has not been broken down separately into postal code, city name or street name, so we can't tell how many customers we have in each city, for example. Those who have been in the industry for a while know that cleaning up data can sometimes take up to 80% of work time. Another extremely interesting answer was "Lack of data science talent." This means that there is a shortage of people to work in this particular field. This is all the more a signal to those who are thinking of re-branding, that the market is just waiting for new people to get into data analysis.
How to start your adventure in Data Science?
At the end of the report is a summary of those who would like to try their hand at the industry. When asked what language a newbie should learn, 63% answered "Python," while only 24% of people pointed to R. The breakdown of what sources beginners use to gain domain knowledge included: Kaggle, online courses, Youtube videos, StackOverflow and blogs. Those who have been employed for some time found similar sources, but in a different order. Experienced analysts are less likely to watch Youtube videos, instead turning more often to blogs, podcasts and StackOverflow. Interestingly, it is the early adopters who are more likely to use official documentation. It would seem that it is usually quite "indigestible" and should be used mainly by experienced professionals.
Finally, there was another breakdown, this time of the ways in which beginners and those with experience in the industry search for jobs. The former mainly look on the websites of individual companies, or on classifieds sites, both those dedicated to the technical industry and general ones. Those with some experience, for their part, rely mainly on recruiters, friends and acquaintances, and only then on advertisements.
Results of the report in a nutshell
In conclusion, the report released by Kaggle provided a lot of interesting information. It turns out that those involved in Data Science tend to be quite young, although there are many people under 40, and individuals as old as 60. These are people with higher education (from undergraduate degrees to PhDs, while not necessarily exactly in this field), whose salaries range from 7 to 14 thousand zlotys per month. And although by world standards these salaries do not knock down, they are still a greedy morsel by Polish standards. Those who would like to try their hand at this industry should, first of all, master the Python language, since most people actually employed use it in their work, as well as relational databases and SQL query language. Moreover, familiarity with the Git version control system may come in handy. Respondents confirm that in addition to the need for data cleaning, a major problem is the lack of talented people to do the job.
It's worth mentioning that the Data Science Bootcamp organized by Sages offers exactly what the market expects, as confirmed by the data collected by Kaggle: practical skills in Python, which is taught from scratch, the Git version control system, as well as the use of SQL query language so as to extract data from a relational database for further analysis. Of course, loading text data, or cleaning the obtained data, is not forgotten here. Then, after acquiring these basic skills, the course is geared toward learning statistics, as well as building various predictive models, including those listed as some of the most commonly used by practitioners in surveys.