Schools are segregated by educational outcomes in the digital space

The Internet provides students with a unique opportunity to connect and maintain social ties with peers from other schools, irrespective of how far they are from each other. However, little is known about the real structure of such online relationships. In this paper, we investigate the structure of interschool friendship on a popular social networking site. We use data from 36, 951 students from 590 schools of a large European city. We find that the probability of a friendship tie between students from neighboring schools is high and that it decreases with the distance between schools following the power law. We also find that students are more likely to be connected if the educational outcomes of their schools are similar. We show that this fact is not a consequence of residential segregation. While high- and low-performing schools are evenly distributed across the city, this is not the case for the digital space, where schools turn out to be segregated by educational outcomes. There is no significant correlation between the educational outcomes of a school and its geographical neighbors; however, there is a strong correlation between the educational outcomes of a school and its digital neighbors. These results challenge the common assumption that the Internet is a borderless space, and may have important implications for the understanding of educational inequality in the digital age.


Introduction
The Internet creates unique opportunities for people to connect with each other. It may, therefore, be significantly beneficial for its users because social ties are known to play a significant role in human well-being including life-satisfaction [1], health [2,3], and professional development [4,5]. There is growing evidence that these findings apply not only to offline social ties but to online friendship as well [6,7]. This role of the internet may be particularly important for underprivileged groups of people such as students from low-performing schools who lack resources in their immediate environment. Connections with students from high-performing schools might potentially influence their university aspirations [8], improve educational outcomes [9], and promote positive behavioral change [10].
People from underprivileged backgrounds tend not to benefit as much as their peers from the Internet (a phenomenon usually referred to as digital inequality [11]). While well-educated people often use the Internet for medical or juridical advice, job seeking or education, their less educated peers use it predominantly for entertainment [12][13][14]. The use of social media by students is known to be differentiated in a similar way depending on their academic performance. High-performing students use it for information seeking while low-performing students for chatting and entertainment [15,16]. It may be expected that online social ties would also depend on academic achievements and that students might be segregated by the educational outcomes in the digital space. At a general level, segregation is the degree to which several groups of people are separated from each other [17]. In this paper, we investigate whether students from high-and low-performing schools are separated (i.e. not connected via online friendship) in the digital space.
We use data from 36, 951 15-year-old students from 590 schools of Saint Petersburg, Russia, registered on a popular social networking site VK (http://vk.com) (see Methods for details about the sample). VK is the Russian analog of Facebook and the largest European social networking site. It is ubiquitous among young Russians: more than 90% of 18-24-year-olds use it regularly [18]. The information in users' public profiles includes their age and the schools they are studying in. This information is available via the open application programming interface (API) of VK. We use the VK API to download information about all students who indicate that they study in one of Saint Petersburg's schools and who were born in 2001 (i.e. that students were 15 years old at the time of data collection).
Similar to other social networking sites, users might become "friends" on VK if they mutually confirm this status. We use information about such online friendships to construct a weighted network of schools (Fig 1), where two schools are connected if there is at least one friendship tie between their students (see Methods for details), and the weight corresponds to the number of such ties. For each school, the information about its geographical coordinates along with the performance of its graduates on the unified state examination (USE) is available (see Methods). The USE scores serve as a proxy for schools' educational outcomes.
Residential segregation by income is believed to be an important source of variation in schools' educational outcomes in some countries [19][20][21]. It means that low-performing schools are concentrated in less affluent neighborhoods and the educational outcomes of a school could be effectively predicted from the socioeconomic status of its district [22]. The situation might be different in Saint Petersburg thanks to the egalitarian nature of the Russian educational system inherited from the Soviet period. To account for potential effects of residential segregation, we collect data from 11, 034 apartments from the largest Russian real estate site CIAN (http://cian.ru) and use average apartment price as a proxy of neighborhood affluence. We then check whether schools' educational outcomes are correlated with the affluence of their neighborhood.
We measure geographical segregation of schools as a correlation between the educational outcomes of a school and those of its closest geographical neighbors. We then compare this segregation with that in the digital space. In this case, instead of the closest geographical neighbors, we examine the educational outcomes of schools' closest digital neighbors. We assume that the distance between two schools in the digital space is inversely proportional to the number of online friendship ties between them.
The probability of an online friendship between two people is known to be strongly dependent on the geographical distance between them [23][24][25][26]. It is, therefore, important to ensure that any observed effect for the digital network of schools is not solely driven by the geographical constraints. To achieve this, we use a random graph model that preserves geographical constraints-namely, the probability of a friendship tie between two schools given the geographical distance between them. We then compare the results obtained for such random networks with the observed results for the real network.

Distance and online relationships
We find that geographical distance plays an important role in the formation of an interschool friendship. The probability of a friendship tie between two close schools is high (0.75) but it declines rapidly with distance following the power law (Fig 2). The best fit is provided by the exponent −0.62 (Fig 2 inset), which is similar to the previously observed results [26].

Geographical segregation
We find that the educational outcomes of schools do not depend on their distance from the city center (Pearson correlation coefficient between USE scores of schools and their distance from the center is 0.018, P = 0.65). The distance from the center may be, however, a poor proxy for neighborhood affluence. Hence, we additionally collect information about apartment prices across the city. We use the average apartment price in the area where schools are located as a proxy for their neighborhood affluence. We then compute the correlation between schools' USE scores and neighborhood affluence, S n (R) (see Methods). The exact value depends on R (see S1 Fig), and the maximum value is S n = 0.12 (P = 0.007), indicating a weak correlation between educational outcomes and neighborhood affluence. Finally, we compute a correlation between USE scores of schools and average USE score of their N closest geographical neighbors, S g (N) (see Methods). We find no correlation S g (N) = 0.01 (P = 0.73) for N = 20 (Fig 3a); this result holds true for all values of N (S2 Fig). We, therefore, find that there is only a weak if any relationship between educational outcomes of a school and its location in physical space. However, as we show in the next section, this result does not apply for the school location in the digital space.

Digital segregation
We find that there is a relatively strong correlation between the educational outcomes of schools and their N closest digital neighbors (see Methods). S d (N) = 0.47 (P < 10 −33 ) for N = 20 (Fig 3b). The correlation is significant for all N (S2 Fig). To rule out the role of geographical constraints in the observed digital segregation, we use a random graph model that preserves relationships between distance and probability of a friendship tie from the observed network (i.e. we create a tie between two schools with a probability from distribution represented in Fig 2 that   We also find that high-performing schools not only tend to be connected with each other but also have more connections on average than low-performing schools. The correlation between the degree centrality of schools in the network and their educational outcomes is 0.49 (S3 Fig). This correlation might be partially explained by the presence of high performing selective schools that attract students from all over the city (see S1 Text for details).
One of the strongest predictors of academic achievements is the socioeconomic status of students [27]. This is true not only on the individual level but also on the school level, i.e. the socioeconomic composition of the student body is the strong predictor of school's educational outcomes. For Russian schools, 34%-41% of the variance in average USE scores is explained by the socioeconomic composition of the student body, the same amount that is explained by school's material and human resources [28]. It is, therefore, noteworthy, that the degree centrality is such a strong predictor of educational outcomes. Note that this is a simple network property and that it does not contain any information about schools or students themselves.
We show, therefore, that the educational outcomes of a school are closely related to its location in the digital space. More central schools tend to be high performing. We also show that schools with similar academic performance tend to be connected in the digital space. We demonstrate that these results cannot be explained by schools' locations in the physical space.

Discussion
Both for research and policy-making purposes, it is crucial to understand the context in which schools operate. This requirement traditionally means collecting information about school resources and the socioeconomic status of its students. Today, students spend much of their time online [29], and it may be warranted to consider students' online environment on a par with their home environment. In this paper, we focus only on one dimension of such an online environment, namely interschool friendship on a social networking site. We find that school position in an online friendship network could explain as much variation in the educational outcomes of its students as their socioeconomic status, indicating the importance of the digital context. Online inequalities might merely reflect existing socioeconomic inequality or rather complement it. In particular, it is not known if students from different schools who are friends on VK know each other offline or these connections are only virtual. Future research is required to clarify this relationship.
Social media have become the main source of information for young people. In Russia, VK is referred to as the main source of information about the country and the world by 70.3% of respondents-more than any other information source [30]. It is also considered more trustworthy than traditional media [30]. The news feed of the social network mainly comprises posts shared by online friends. Friends from different schools may, therefore, be an important source of diversity in the information environment of students. In particular, the connections with students from high-performing schools could have a positive impact on students from low-performing schools. However, our results suggest that interschool friendship ties mainly exist between schools with similar educational outcomes. Intriguingly, this digital separation cannot be explained by the geographical location of schools. This result means that the digital environment not only fails to remove segregation but rather might amplify it.

Data collection
According to the open data government portal (http://data.gov.spb.ru), there are 638 high schools in Saint Petersburg. This number excludes specific types of schools such as boarding schools, cadet schools, and educational centers. We use open VK API to find these schools in the VK database. We find VK IDs for 628 of the schools. We exclude school №1 from the sample because it has an unreasonable number of users (more than 1000 per cohort). We also exclude two pairs of schools with identical names. We then use data from the web portal "Schools of Saint Petersburg" (http://www.shkola-spb.ru) to obtain the average performance of schools' graduates at the Unified State Examination. This is a mandatory state examination that all school graduates should pass in Russia. This information was available for 590 schools from our sample.
We then perform requests to VK API to obtain the lists of all users who were born in 2001 and indicate that they are studying in one of the schools from our sample. To exclude users who provided false information about their school, we remove profiles with no friends from the same school, as previously recommended [31]. We also exclude students who indicate several schools in their profiles. Finally, we download the lists of all VK friends for users from our sample. All collected data is publicly available. The VK team confirmed to us that we can use its API in this way for research purposes.
We also use data from the largest Russian real estate site CIAN to collect information about the prices of all 2-room apartments in Saint Petersburg listed on the site. For each apartment, its price per square meter was calculated. CIAN team approved the use of this data for research purposes.

Network of schools
We define a 36,951 × 36,951 adjacency matrix F that represents the friendship network of students (i.e. F i,j = 1 if students i and j are friends on VK and F i,j = 0 otherwise). We assume that student i studies in school s(i), and construct a weighted network of schools by counting the number of all friendship ties between two schools. This network is represented by 590 × 590 matrix A where A k;l ¼ X fi;jjsðiÞ¼k;sðjÞ¼lg F i;j : One potential disadvantage of this definition is that two schools could be considered as closely connected when only one student from the first school has a lot of friends from the other. We therefore also use an alternative way to define the weight of the school tie. In this case, instead of friendship ties, we count the number of students from one school that have friends from another (i.e. we defineÃ k;l ¼ jfijsðiÞ ¼ k and 9j : F i;j ¼ 1; sðjÞ ¼ lgj). We could then construct a symmetric matrixÂ k;l ¼ minðÃ k;l ;Ã l;k Þ. This alternative metric leads to the same results, and therefore we opted for the first more straightforward approach.

Segregation measures
If U i is the average performance on the Unified State Examination of graduates from school i, we could then define segregation based on the affluence of school neighborhoods in the following way: where P j is the price of apartment j in rubles per square meter and d(i, j) is the distance between school i and apartment j.
We denote geographical neighbors of school i by N g (i). N g (i) = (s i,1 , . . ., s i,590 ) is an ordered list of all schools such asdði; s i;k Þ <¼dði; s i;kþ1 Þ, whered is the geographical distance between schools. We then denote the list of k-closest geographical neighbors by N k g ðiÞ ¼ ðs i;1 ; :::; s i;k Þ. We define the k-closest digital neighbors N k d ðiÞ by replacing geographical distance with the digital distance that is equal to 1/A i,j .
We then define geographical and digital segregations by academic performance in the following manner: Note that in the case of digital segregation, there could be several schools with exactly the same distance from a certain school. In this case, N k d ðiÞ is not uniquely defined. In our computations, we randomly select with equal probabilities one of the possible N k d ðiÞ.

Ethical considerations
The data was collected as part of the "Digital Trace" project that was approved by the Institutional Review Board of the National Research University Higher School of Economics. Note that the units of our analysis are schools rather than individuals. Public information about friendship ties between users who indicated their high schools on VK was used to construct a friendship network between schools. Neither names of users nor other personal information available from VK were analyzed or collected as part of this research.