Find the world's strongest algorithm! App picking up from 200,000 people through mobile app

If you walk outside of the cafe, the waiter will know that you will be here on time every Wednesday morning at 8:15 and prepare your favorite Macchiato in advance. It must be a great feeling.

This passage was written on the introduction page of the TalkingData Global Algorithm Contest of Kaggle, a world-renowned predictive modeling and analysis platform.

The number of registered users of the Kaggle platform is now over 600,000. It comes from 194 countries around the world and has a background in all walks of life. The Kaggle platform has also become a support platform for many important data science competitions due to the scientific and open attitude of the competition system. At present, more than 1200 games have been held on the crowd-sourcing platform of Kaggle, and most of the competitions have come from industry, providing many data science solutions.

From another point of view, this platform is similar to a rivers and lakes martial arts rankings, and some people will use this to become famous.

In 2012, Merck, a US pharmaceutical company, launched a 60-day challenge at Kaggle to predict their biological activity through various data from 15 drugs, such as drug targets and unexpected Targets (off-target) and so on. As a result, a team of five from the University of Toronto in Canada received the first gggg.

If you introduce this team directly, you may have to turn a blind eye - who is this? , but the people behind them are coming from behind. I believe many people who are engaged in data science are no strangers. He is Geoffrey hinton.

The status of the three big cattle Yann LeCun, Yoshua Bengio and Geoffrey Hinton in the field of deep learning is unknown. The people who eat melon can look at it. He is so long.

Having said so much, it means that Kaggle is very bullish and will try to compete in a match.

On the kaggle platform, China's third-party mobile data platform TalkingData and machine learning company Turi jointly organized a global algorithm contest. The game begins on July 11 and ends on September 5.

According to information from the scientist Yao Lu, who is in charge of the contest directly by the Data Science Department of TalkingData, he told Lei Fengwang (searching for the “Lei Feng Net” public number) that the original algorithm contest that seems to be common in China should not be able to make any big news. , However, after the final match through the account number, there are actually 1689 teams participating, including 1961 players, a total of more than 24,000 submissions.

Moreover, Lu Yao also counted on some of the interesting things in the contest. Before evoking everyone's interest, let me talk about what kind of competition this is.

Cattle! I think I would not love anymore

The challenge of this competition is to predict the gender and age grouping of device users through mobile device behavior data.

The contest provided desensitization data for about 200,000 users and was divided into 12 groups, for example, males, 22 to 25 years old, females, 30 to 35 years old and so on. At the same time, it also provides user behavior attributes, such as time, geographical location, mobile brand, model, etc., through which the player guesses which group the user belongs to.

Well, it looks a bit difficult.

The evaluation method of the competition is that the player needs to calculate the probability of the user in different groups. A user can only be in one group. In an ideal state, if the probability is 1 and the others are 0, then there is no probability of loss, but Their answer is generally on different groups. This user belongs to this group with different probability. At this time, there is a probability loss. The evaluation index of the contest is the probability loss.

Look again, it's complicated, right? Well, we can be more complicated.

The data in the table is firstly an age and gender group. Each user is represented by an ID. A user's behavior is in a series of events. The information in each event includes which latitude and longitude the ID appears at, and which APPs are installed, which APPs are used, and the brand and model of the mobile phone.

Of course, the ID of APP, including latitude and longitude, has undergone rigorous and scientific desensitization.

However, only the desensitized ID is presented on the APP, and the player has no concept of the APP itself. In order to allow the player to better interpret the data, the organizer has attached some labels to the APP, such as social networking, games, etc., a total of more than 1000 label.

The onlookers said that it looks like fishing in the sea is not right?

The interpretation of these data is only the first step to success. The next step is feature engineering.

What characteristics have players extracted? For example, when was the user active? Rest day or workday? Day or night?

The data also includes overseas data and sometimes poor questions. The user has a trajectory, what is the trajectory distribution? Is it gathered in a block? Or gather at several points? How far is the distance between them? Where does it often appear on China's southeast coast or in the northwest? What are the characteristics of these locations?

In addition, which of the installed APPs are installed but not used for a long time? What kind of information can this provide?

Why is it 100,000? It is undeniable that these characteristic values ​​are very knowledgeable. It is a technical job to take the value 01 or a more specific weight.

Lu Yao introduced the side, while the "star eyes." Because there was a very imaginative move in the game, and she did not even think of doing the project!

When you predict the result, can the result be fed back to the model as a feature? For example, predicting the age and gender of the group, the gender is relatively good prediction, the accuracy rate is higher, after the forecast returns the gender characteristics to return whether it can improve the age prediction? The age accuracy rate will be lower, but some special age characteristics are more obvious. If these are found and then returned to the model, can they improve the overall result?

After completing the feature engineering, we entered the model adjustment process. This is also a test of data scientists' experiences and skills. What are the simplest initial parameters when the single model is down-regulated? Is it a random value or a special value? There may be a big impact on the speed of convergence.

In the integration of models, knowledge is even more. Taking a neural network as an example, designing a neuron in a neural network requires knowing how many or how many layers are in each layer, and there must be similar ideas in model integration. These models are divided into several layers. Who and who are connected in parallel, and who is in series relationship with them. If it is a tandem relationship, what kind of information should be processed by the next layer? Is it a direct result of the process, or is it an error or something?

With a good model, good characteristics, submission of results, ranking on the front, is it right?

Oh, hey, it's not necessarily true.

There is also a big enemy called over fitting.

The method of overfitting performs very well in a particular dataset, but this dataset changes slightly and the performance of the model drops rapidly.

The game is divided into test sets and training sets. The training set tells the players all the information, and then guesses the groupings in the test set.

The player can see the public list after submitting the results, but only the administrator can see the private list, but the result of the match is determined by the private list. The public list is for reference only.

The public list's test set includes only 1/4 to 1/3 of the data, and Kaggle does not limit the number of submissions. If you are ranked top in the public rankings, then you may have a bad ranking on the private list.

How to solve this problem, Kaggle's old driver tells you to always do cross-validation! ! ! ! Take a small notebook! !

So far, although the contest has not officially announced the winners, the private list has already been published on the Kaggle contest homepage! Let's take a look -

There are a few interesting points in this game, so you can simply take a look!

1. Talking Data Chief Data Scientist Zhang Summer told Lei Feng that although the data volume is relatively small, only about 200,000 groups, but it is "sparse" data, that is, after taking the real data for desensitization, some data may be incomplete dimensions, phase Compared to some of the data provided by the game, the tag is complete, meaning that it is more difficult and you are happy.

2. In the Kaggle community, for the first time there was TalkingData China. The god above was very interested in this and even blew up the so-called “Ancient Beast” on the Kaggle rankings. Kaggle’s time was more like those of the cows. Long, with little interest, has not really loved to appear on the platform. Seven of the top 10 candidates in the Kaggle list participated in the TalkingData competition, with 14 out of the top 20. Yes, China is a mysterious country...

3. At last, contestants from more than 70 countries submitted results. The most contestants were not China. They were the United States. What about the second? The second is still not China, it is India, the third is China , including Taiwan, Hong Kong, the fourth is Russia, and the fifth is the United Kingdom.

4. In Kaggle's TalkingData contest community, because the data comes from China, many players need to discuss China's national conditions. The most keen to give everyone the characteristics of China's national conditions is a French brother ...


Industrial Gel Battery

The FirstPower gel battery uses the sealed gel technology and is designed for high reliable, maintenance-free power for renewable energy applications. Depending on the advantage gel technology, optimum grid and plate design, the FirstPower gel battery offers highest power and reliability for your equipments.

Industrial Gel Battery,Industrial Gel Cell ,Deep Cycle Industrial Gel Battery,Lead Acid Industrial Gel Battery

Firstpower Tech. Co., Ltd. , https://www.firstpowersales.com