Converging Work-Talk Patterns in Online Task-Oriented Communities

Much of what we do is accomplished by working collaboratively with others, and a large portion of our lives are spent working and talking; the patterns embodied in the alternation of working and talking can provide much useful insight into task-oriented social behaviors. The available electronic traces of the different kinds of human activities in online communities are an empirical goldmine that can enable the holistic study and understanding of these social systems. Open Source Software (OSS) projects are prototypical examples of collaborative, task-oriented communities, depending on volunteers for high-quality work. Here, we use sequence analysis methods to identify the work-talk patterns of software developers in online communities of Open Source Software projects. We find that software developers prefer to persist in same kinds of activities, i.e., a string of work activities followed by a string of talk activities and so forth, rather than switch them frequently; this tendency strengthens with time, suggesting that developers become more efficient, and can work longer with fewer interruptions. This process is accompanied by the formation of community culture: developers’ patterns in the same communities get closer with time while different communities get relatively more different. The emergence of community culture is apparently driven by both “talk” and “work”. Finally, we also find that workers with good balance between “work” and “talk” tend to produce just as much work as those that focus strongly on “work”; however, the former appear to be more likely to continue to be active contributors in the communities.


Introduction
A great deal of adult life is spent working.We work to create materials that fulfill human needs, to develop advanced technologies, to govern, heal, and teach each other, etc.Our work is often collaborative, and often involves repeated activities: i.e., we commute, work, collaborate with others, etc. Collaborations involve both talking and working.We get some work done, talk with our colleagues to socialize, learn, or further co-ordinate tasks, and then work some more.The recurrent practices constitute patterns of activities that can be used to characterize individuals, cluster them, and then predict their future behaviors; this has potential applications in various areas including crime control [35], traffic forecasting [33], and marketing [10].In this paper, we will focus on the two most basic activities, i.e., work and talk.Note that, talking or communication, as important markers of human relations, play key roles in the coordination between pairs of co-operating individuals.As a result, they are commonly used to infer the social networks as the discrete spaces to study the dynamics of many other activities [40,41].Here, however, we treat work and talk activities equally, and use sequence analysis methods to reveal the work-talk (W-T) patterns of the individuals in online task-oriented communities.
Sequence analysis, which has long history of being useful in molecular biology [53], has been, as of recently, also used in social science [1,13], where researchers investigate life courses [5], and career trajectories [2].Whereas DNA sequences are curled up in three-dimensional space, social events are arranged according to their time of occurrence.Due to our interest in social phenomena mostly local in time, the positions of social events in a sequence refer to relative, rather than absolute, time points.In bioinformatics, a number of global and local sequence alignment methods are used to compare the molecules' genetic similarity within and across different organisms, so as to elucidate their biological functions [45,47].Here we adopt a local alignment method to find and enumerate short patterns in work-talk (W-T) sequences of different software developers.We use these short W-T pattern counts as data points for modeling developer behavior using hidden Markov models (HMMs) [21].The goodness of fit of these models are established via their ability to predict the numbers of larger patterns in the sequences [47].
In collaborative projects the interplay of work and talk activities play a central role; therefore, the fitted parameters of these W-T HMMs can be used to characterize not just the W-T patterns of different individuals, but also help describe the projects' work culture.By "work culture" here we mean the tendency of a group of individuals to share similar W-T patterns.The simplest such distinguishing pattern is that they either tend to work continuously (thus creating shared work products) or talk continuously (co-ordinating work with others, and strengthening relationships).More complex patterns are a combination of the two.This connotation of "culture" is consistent with Etzioni's notion [23]: "the set of assumptions shared by members of a societal unit which sets a context for its view of the world and itself".It is known that community culture plays an important role in innovation [30], the quality of work-products [6], and can facilitate the decision-making [16].Community size and its evolution are related to productivity and efficiency.Recent work on collaboration indicate that team size [55] and the team assembling mechanisms [26] have significant effects on team performance.
We used open-source software (OSS) communities [20,39,54] to study W-T patterns for three main reasons.First, the work in OSS communities is easy to observe, and most of the talk activities are meaningfully related (because of community norms) to work activities; this simplifies the observation of functional W-T patterns.Second, the work and talk activities in OSS communities are always archived [42], so they are readily collected for analysis.Finally, performance properties, such as productivity, in terms of number of lines of code written, can also be measured using the state of the produced software.Communication is vital in making collaboration effective [57].Lack of communication between software developers may introduce more coordination problems [28], and thus the distributed developers do need to maintain awareness of one another to make the OSS projects successful [27].A recent empirical research indicates that communication and commits may accelerate each other in Apache OSS projects [58], and similarly in Stack Overflow and GitHub [50].However, it is also argued that, since both communication and working activities may compete for the time resources of individuals, communication may have some negative effect on the efficiency of work activities [38,59].In this case, communication can be considered as a kind of interruptions, recovering from which developers may take some time to continue their work [36,49].
In this paper, by studying W-T patterns of developers in OSS communities, we make the following main contributions.
• We establish HMM models for the W-T sequences of developers and find the evidence of community culture: developers in the same community tend to have more similar W-T patterns than those from different communities, and this patternaffinity strengthens with time.
• We observed that developers who have balanced W-T patterns are just as productive as those who work more continuously (fewer interruptions), but the former tend to stay active longer; this suggests that W-T balance is important to sustain OSS communities.
• We create social and cooperation networks, and find that the convergence of W-T patterns between a given pair of developers appears to be re-inforced by both talk-and work-related interactions.This indicates that the emergence of a community W-T culture is apparently driven by both "work" and "talk", and thus offers a novel perspective on the co-evolving mechanisms of socio-technical, interdependent networks [12,14,56].

Work-Talk Activities in OSS
Human activities can be represented by time-series denoting the event occurrences.As a generalization, different kinds of activities can be represented by an asynchronous multiple time-series [37] since people seldom do different things at exactly the same time; this is quite different from the multiple time-series in economics where different kinds of indexes are recorded at the same time for comparison.Since work and talk are the activities of concern here, we will use sequence analysis to study these two kinds of activities in OSS communities, with primary focus on their relative time orders, without concern for the precise time of occurrence.

Data Description
We collected the individual work and talk activities from 31 OSS developer communities in the Apache Software Foundation on March 24th, 2012.In each community, there are several volunteer developers who contribute by committing to files, i.e., adding or removing software code; these activities are recorded in a Git repository.These are our "work events", or W. Members of the online communities communicate using the developer mailing lists with the rest of the community through emails to share programming knowledge, and coordi-nate with others.We record the sent emails as "talk events", or T, of a developer (the received emails are included in the talk activities of others).Using this data, a W-T sequence concerning the work and talk activities can be recorded for each developer.Note that messages may be automatically posted to a mailing list in an OSS community to inform others when some work is done.
In order to exclude such trivial talk activities, here we just consider those response emails [58] which takes up about 73% of all messages [8].Moreover, we also use a semi-automatic approach to solve the problem of multiple aliases [8].
We pre-process the W-T sequence data in several ways.To ensure a sufficient number of samples to reliably compare the W-T patterns between the pairs of developers of the same or from different communities, we select a subset of the developers with sequences including at least 500 work and talk activities, and a subset of communities with at least 5 such developers.We acknowledge a risk of left-censorship of both work & talk activities, if any OSS communities did not archive their emails, or if they had used different version control systems before they moved to Git, some early data could be lost.Besides, it is known that many individuals need to first earn social capital in the OSS project by communicating with others before they are accepted as developers [9,25].As a result, we often observe long, W-T time-series: W-T sequence: After pre-processing the data, we are left with 14 communities totally containing 120 developers, and several of their basic properties are presented in Table 1.Note that besides the developers, we also list the number of active users (including developers) in each project.These users might not directly commit to files, but they may contribute to the communities by other ways, such as reporting bugs etc.

Pattern Analysis
A G-pattern in a sequence over the alphabet {W, T} is a subsequence of length G.There are total 2 G possible different G-patterns.Typically, the length of the patterns is much shorter than the length of the given sequence.In our study we focus on 2-patterns and 3-patterns.Given a sequence θ = {s 1 , s 2 , . . ., s h } over {W, T}, we count the occurrence of each of the 2 G patterns, by rolling a window of size G over the sequence, and incrementing the count for the pattern we find.E.g., in the W-T sequence shown in Figure 1, the four possible 2-patterns, WW, WT, TW, and TT, occur eight, five, five, and six times, respectively.
To assess the probability that a pattern occurs by chance, we create a null (baseline) model by randomizing the observed W-T sequence so as to preserve the pro-1-

1-+1
Figure 2: An HMM with two states, i.e., "work" and "talk", denoted by "W" and "T", respectively.The model is used to explain the W-T patterns of developers in different communities.portion of work to talk activities.This can be achieved, e.g., by using the R package's [48], sample() function, on the sequence indexes.Then, the preference for pattern P in the observed sequence, θ, over the randomized sequence, θ * , is calculated by the relative difference between the counts for that pattern, C P and C * P , in the respective sequences, For each pattern P in a sequence, we also calculate its Z-score [32] as λ P C * P /ς, where ς is the standard deviation of the pattern counts in θ * .For C * P , we generated 100 randomized sequences for each observed one, and the absolute values of the Z-scores for 2-patterns in our study are larger than 5, in 462 out of 480 cases, indicating that the observed counts are surprising.These random sequences are also used as references to evaluate the HMM model (represented in Figure 2) on predicting 3-patterns.(See the Appendix for further mathematical description of the HMM.)Given an observed W-T sequence for each person, we count in it the occurrences of two-patterns.Then, we derive the preference of each pattern, denoted by λ i , i = 1, 2, 3, 4, respectively, in the real sequences as compared to random ones (for each real sequence we bootstrapped from it 100 simulated W-T sequences, by randomizing the order of its elements).In our data set, we find that, on average for all developers, λ 1 = 148.9%and λ 4 = 40.5%,while λ 2 = −38.0%and λ 3 = −38.6%,i.e., WW and TT are positively enriched, while WT and TW are negatively enriched.This suggests that developers much prefer to persist with one activity-type, rather than switch between activities.
It may be argued that two successive activities should not be considered as a two-pattern if the time interval between them is relatively long, e.g., longer than one month.To show that our method is robust with respect to time-scale, we also calculate the relative difference by varying the thresholds for the time-intervals over which we consider the two-patterns.We vary the thresholds, denoted by ξ = 1, 7, 30 (days), and only the patterns with intervals ≤ ξ are considered.The results are shown in Figure 3, where we can see that WW and TT patterns are always much more preferred than WT and TW patterns in the real sequences under thresholds varying from one day to one month.Interestingly, we also find a slight trend that the WW pattern becomes more preferred, and the TT pattern less preferred, when we exclude more repeated activities with relatively shorter time intervals (and thus a smaller ξ).Since the number of these long time-interval patterns is relatively small (2.2% and 0.3% for ξ = 7 and ξ = 30, respectively), this slight trend still indicates that developers are more likely to start and end a repeated and relatively compressed work sequence with talk activities, viz., talk activities plays important role in enabling new tasks (work activities) in these online communities.
In order to study W-T patterns in more detail, we create a two-state HMM with parameters α and β representing the conditional transition probabilities P (W|W) and P (T|T), respectively, for each developer (see the Appendix).Based on the model, we have where P WW , P WT , P TW , and P TT denote the probabilities of the four different two-patterns for each developer, and can be estimated by the counting numbers of the four different two-patterns, i.e., M i , i = 1, 2, 3, 4, respectively.Then, we get as long as the corresponding W-T sequence contains enough elements.Here, it is not difficult to prove that the condition |M 2 − M 3 | ≤ 1 must be satisfied for any W-T sequence.This HMM is fully determined by the numbers of the four different two-patterns.We validate the model by checking its ability to predict the numbers of larger patterns, e.g., three-patterns.There are eight different three-patterns, WWW . . .TTT.For each developer, we denote by M i , M * i , and M • i , i = 1, 2, . . ., 8, the numbers of three-patterns in the real sequence, the random sequence, and the sequence created by the model (with the same length and the initial element), respectively.Then, for this developer, we can calculate the relative error introduced by the random mechanism and the model by respectively.For each developer, and for each threepattern, the difference between these two errors can be used to validate the ability of the HMM in predicting that pattern: viz., if the relative errors introduced by the model is significantly smaller than those introduced by the random mechanism, it is reasonable to believe that the model feasibly predicts that pattern.The differences between these two kinds of errors for all the eight different three-patterns are visualized by the boxand-whisker diagrams shown in Figure 4 with all the developers considered together, where we can see that the HMM does indeed predict the numbers of all the eight three-patterns with significantly smaller relative errors (p = 1.8 × 10 −16 on average) than the random mechanism for the developers we studied, i.e., 14.5% versus 67.4% on average.

Community Culture
Generally, people sharing similar habits and interests are more likely to come together; once together, they could further influence each other, so as to form the commu-  nity culture.Then, one interesting question is: do developers in the same OSS communities present more similar W-T patterns (or closer α and β more specifically) than developers from different communities, i.e., can W-T patterns be used as a metric to characterize community culture?

Converging Work-Talk Patterns and Community Culture
In order to answer the above question, we first visualize all (α, β) pairs in the α − β plane, as shown in Figure 5, where the developers of the same communities are marked by the same symbols.Evidence of clustering is visually apparent: the points representing the developers in the same communities are indeed closer to each other when compared with those from different communities.
We further classified all the developers into three groups by the k-means method [31], and find that most developers in the same communities are centralized in one of three clusters, rather than uniformly distributed in all the three, which indicates different community cultures More specifically, most developers (≥ 50%) in Derby, Lucene, Mahout, Nutch, and Solr belong to cluster #1, which corresponds to mostly talk activities (high β), while most of the developers in Axis2 c, Camel, Cxf, Ode, Openejb, and Wicket belong to cluster #3, corresponding to mostly work activities (high α).As a whole, we define the center of a community in α − β plane by the median of the HMM parameters of the developers in it, then calculate its diversity by the average distances of HMM parameters between the developers and the center, as shown in Figure 6 for the above 11 communities.It is interesting to find that the communities sharing similar W-T patterns also belong to similar domains (description in Table 1).For example, Lucene, Nutch, and Solr are all about "search" and they are intrinsically related to each other, just like the introduction of Nutch on its web site: "Stemming from Apache Lucene, Apache Nutch now builds on Apache Solr adding webspecifics".Besides, Axis2 c, Cxf, and Ode are all about "services", while each of Camel, Cxf, and Wicket is a software framework that provides a shared architecture for class of applications.
More formally, if we denote by α i and β i the HMM parameters of developer d i , we can calculate the Euclidean distance of HMM parameters between two developers d i and d j by as a quantitative metric for the similarity between the W-T patterns of developers, i.e., the shorter the distance between them, the more similar the W-T patterns of the two developers.Then, we compare the distances of HMM parameters between all pairs of developers in the same communities with those between pairs of developers from different communities, and find that the former list of distances are significantly shorter (p = 0) than the later ones.These qualitative and quantitative analysis lend support to using the HMM parameters as a reasonable proxy for the way the interplay of work and talk testify to community culture.The clustering phenomenon of W-T patterns gives rise to another interesting question: Do developers choose to join communities with similar W-T patterns as theirs or does the similarity emerge over time as developers participate and evolve with their communities?In the first case, the developers in the same community will present similar W-T patterns from the very beginning, while in the second case, they will get more similar with time.To answer this question, we do the same pattern analysis as above, using only the initial 100 activities in the W-T sequences.Based on the comparison, we find that: 1.The developers in the same community showed similar W-T patterns starting with their inception into the project.I.e., for their first 100 activities, the distances of HMM parameters between pairs of developers in the same communities are significantly shorter (p = 3.1 × 10 −13 ) than those from different communities.

Distance ρ
First 100 activities All activities

Inter Inner
Figure 7: The box-and-whisker diagrams for the distances of the HMM parameters of the first 100 activities and those of the whole W-T sequences between pairs of developers inner and inter communities.
2. In addition, the community cultures of different communities converge rather than diverge from each other, as time evolves.I.e., both the inner (within-community) and inter (betweencommunity) distances decrease significantly (p = 0) with time, as shown in Figure 7.We also calculate the average inner distance for all communities by considering their respective first activities with different values of , as shown in Figure 8, to study the converging process.We find that the inner distances decrease as increases, for most communities.As examples, the evolutions of the HMM parameters with time for the communities Axis2 java, Derby, and Lucene are shown in Figure 9.
3. The clustering of the HMM parameters within communities grows tighter with time.I.e., the convergence rates of the parameter distances from the first 100 activities to all activities within communities (the average distance decreases from 0.3381 to 0.1832) is significantly larger (p = 1.7 × 10 −7 ) than those between communities (it decreases from 0.4216 to 0.2861).
These findings suggest that developers with similar W-T patterns are indeed more likely to join in the same communities, and continue to harmonize their patterns  as they work and talk as a team.In fact, since there are many online communities on similar topics, people can first experience the culture of these communities and then decide to join or not [17,43,44].For OSS, it is clear that most developers do communicate a fair bit on the developer mailing list before actually contributing work [9,52]; indeed, this type of "socialization" is a necessary pre-requisite to having one's work contributions accepted.Thus, it is to be expected that the de-velopers are more likely to join in the communities with harmonized work and talk patterns, in order to reduce co-ordination efforts.
In addition, we find that different community cultures will slightly converge rather than diverge from each other over time; this suggests that there may be an overarching trend of the W-T patterns for all the developers (in all projects).To investigate this further, we compare the two parameters α and β separately for all developers, considering a) the first 100 activities and b) all activities.We find that both of them increase as time evolves, i.e., the HMMs in case a) have significantly smaller α (p = 2.7×10 −2 ) and β (p = 1.4×10 −5 ) than those in b).In fact, the efficiency of overall work and talk activities may be measured by the sum α + β; larger values of this sum indicate less switching between activities and thus fewer interruptions.This arguably represents higher efficiency [4,7,36].In other words, the HMM parameters (α i , β i ) shown in Figure 5 can be fitted by the linear function: α with a single parameter ε representing the average efficiency of all the developers.Using the least squares method, we get the average efficiency ε and the corresponding standard deviation σ from the regression line as respectively, for the N developers.We find that the average efficiency steadily increases, while the variance decreases, with time, which means that as time goes on developers tend to have longer bursts of pure work and pure talk, suggesting that their discussions are becoming more effective, and that the ensuing co-operative work proceeds relatively more uninterruptedly.
Looking at the change in the rate of talk activities for all developers, in terms of α and β, equation ( 16), we find that the rate increases significantly (p = 4.6×10 −3 ) with time, indicating that most developers become more socialized in the process.This phenomenon is consistent with the fact that more discussions are always needed to further improve a mature product.Meanwhile, contributing to these online communities is social work, i.e., the contributions of developers are highly visible and will be checked by many other users, so it is not surprising that they need to reply to comments more frequently when contributing more.

Individual Performance and Community Culture
Regarding community culture, it is always very interesting to try to identify its benefits on individuals in the respective communities.For example, it is reasonable to hypothesize that developers who work more than they talk will have higher productivity, meaning they will produce more lines of codes (LoC), than those seeking balance between work and talk activities (similarly, those preferring talk over work may have a socialization advantage compared to those seeking balanced activities).We ask, does increasing productivity always come at the price of decreasing socialization, and vice versa?Note that looking simply for strong work or talk preference, i.e., larger α or β, respectively, does not necessarily lead to higher productivity or socialization because our sequence analysis does not take the activity time into consideration and is also independent from the length of the sequences.
To answer that we study the correlations with community culture of five measures of individual performance work rhythm (# work activities per day), thousands of lines of code added per unit time (KLoC), talk rhythm (# talk activities per day), newly established social links per week, and observed survival time, resp., X 1 to X 5 , as summarized in Table 2.The first four properties are calculated in the same time period of the person's W-T sequence.The survival time, X 5 , of a developer is defined as the period of time from their first activity to the last one, which may be longer than the period of their W-T sequence, considering that the W-T sequences under study were preprocessed by removing prefixes of pure work or talk activities.The survival time of a developer is only observed when the developer has left the respective community.Here, as a reasonable estimation, we consider that a developer has left the community if they have not been active for a relatively long time, i.e., longer than some threshold T .
All developers are divided into three clusters by their HMM parameters, as shown in Figure 5 Working rhythm: the number of work activities per day 1.9 × 10 −2 5.4 The kilo lines of added codes (KLoC) per day 1.5 × 10 −2 7.8 Talking rhythm: the number of talk activities per day 1.2 × 10 −2 5.8 × 10 −2 6.6 × 10 −6 X 4 The number of new social links per week 9.0 × 10 −1 3.7 × 10 −2 2.0 × 10 −3 X 5 The observed survival time (year) 3.0 × 10 −1 3.7 × 10 −2 2.7 × 10 −1  Value in Cluster #1 emphasize "talk", those in Cluster #3 emphasize "work", while those in Cluster #2 seek balance between the two.For each property from X 1 to X 4 , we have a list of their values for developers in each cluster, and the comparisons between the properties of developers in different clusters are visualized by the box-andwhisker diagrams shown in Figure 10 (left), with the significance presented in Table 2.We find that the developers in Cluster #3 have the fastest working rhythms, those in Cluster #2 follow, while the developers in Cluster #1 work the slowest.The direction reverses for their talking rhythms.However, the situation is a little different when we compare the abilities of developers of different clusters in producing codes and earning social status.We find that the developers in Cluster #2 and Cluster #3 can produce similar KLoC per day, and both groups produce significantly more than the developers in Cluster #1, while the developers in Cluster #2 and Cluster #1 earn similar numbers of social links per week, and both groups earn significantly more than the developers in Cluster #3.These indicate that extended discussion is always accompanied with the slowing down of work rhythms, but not always with decrease of productivity, and the developers seeking balance between work and talk behave competitively on both productivity and socialization as those who mostly work or mostly talk.Although it seems that the developers who mostly work have the fastest working rhythms and the highest productivity, on average, it doesn't mean that choice is the healthiest for them or for the overall project, since these developers are more likely to feel boring and then quit the projects.To analyze the survival times of developers (time from joining until leaving) in terms of the HMM parameters α and β, we use the Hazard model [22], with the Hazard ratio defined as where x is either α or β. (See the Appendix for further mathematical descriptions of the Hazard model.) We find that developers with smaller α or larger β will have suggestively longer survival times (p = 0.077 and b = 1.7 for α and p = 0.042 and b = −2.4 for β), indicating that, by comparison, talk activities are more important than work activities for developer retention.Indeed, we find that developers with more balance between their work and talk stay active in the projects for suggestively longer periods of time than those who mostly work, as shown in Figure 10 (right), i.e., the significance is equal to 0.037, 0.078, and 0.049 when the survival times of the developers with their last activities occurred half year, one year, and two years before are considered, respectively.The significance of comparison for the survival time among the three clusters of developers are presented in Table 2 when T = 1 (year).These findings suggest that developers with balanced W-T patterns are important to sustain OSS communities.Each of the communities we studied has at least one balanced developer, and there is also a natural trend that developers become more balanced, i.e., both α and β increase with time.

Synchronizing Role of Social and Technical Links
It is well understood that individuals can influence each other's behaviors through social links [11,15,24].Here, we study the extent to which with similar W-T patterns tend to be linked more in the email network or the technical cooperation network.In social networks, social weight between two developers intuitively means the number of emails between them.In cooperation networks, a pair of developers are linked with an edge indicating the number of files on which they have both worked.In particular, denoting by ψ i the list of files that developer d i commits to, the cooperative weight between a pair of developers d i and d j , in terms of the files to which they have committed, is defined as On the social end, for pairs of developers, we ask: are the distances between their HMM parameters correlated with the number of emails they have exchanged?The results of using both Pearson and Spearman correlations are given in Table 3, under the Social weights columns.We find negative correlation in ten out of fourteen communities with both methods, with the significance p < 0.05 in eight of them, while we find positive correlation with the significance p < 0.1 in only one community called Mahout.The negative correlation means that the smaller the HMM parameter distance between two developers, the larger the number of emails they have exchanged.
On the technical end, we study the correlation between the distances of HMM parameters and strength of file cooperation links between developers.Using the same correlation measures as before, we get the results in Table 3, under the Cooperative weights columns.In this case negative correlation is found in eleven out of fourteen communities with both methods, with the significance p < 0.1 in four of them, including Activemq, Ant, Derby, and Solr, while no community has positive correlation with significance p < 0.1.The negative correlation means that the smaller the HMM parameter distance between two developers, the larger the cooperation between them.
When considering all communities together, we obtain a significantly negative correlation for both methods in both cases (the last row of Table 3).Thus, developers with more emails between them or committing to more of the same files are more likely to have similar W-T patterns.The results also indicate that community culture may be either social or task oriented (technical); the distances between HMM parameters are more likely to be correlated with social weights in some communities, and with cooperative weights in others.

Threats to Validity
Although most results shown in this paper are significant, there are still a number of threats to this study.
All the OSS projects under study are collected from Apache, which may limit the generalization of the results.The methods therefore need to be tested on other OSS ecosystems, such as GitHub [19]   GNOME [46], or other online volunteer communities, such as Wikipedia [34,51] in the future.We have collected some GitHub data, and also find the converging W-T patterns there.However, the correlation between W-T patterns and productivity of developers need to be further validated, since currently we haven't yet collected the data about code length added or deleted in a particular commit.We will show the extended results in our future work.
We acknowledge that, like many other empirical researches, our work is based on a sample of work and talk activities of developers, but not all of them.We just consider the commits to code or documents as work activities, while in reality developers may have other kinds of work activities which are relatively difficult to be captured from OSS repositories.Besides, there might be other kinds of talk activities too, e.g., the discussion on issue tracking systems, that are not included in this study.In fact, we indeed collect issue tracking data from Jira and Bugzilla [25], and do experiments by including them as talk activities (both opening issue as initializing the discussion and comments).However, we don't find any result that changes dramatically in this case, indicat-ing the revealed phenomena are quite robust.
We use only LoC to measure the productivity of a developer, while in fact there are alternative metrics, such as the number of issues fixed [60].Moreover, there are also metrics about work efficiency, such as the development time of tasks [29], and about the quality of code, such as the number of bugs [3].We will study the effects of work-talk patterns on these metrics in the future, rather than put them together in this single paper.It's reasonable to use LoC here, since it has been used extensively to measure the productivity of developers [39].

Conclusions
In this paper, we demonstrate that work-talk patterns of software developers in a number of OSS communities can be effectively studied using sequence analysis methods on sequences arising from simple two-state behavior models.Our methods enabled us to learn about a series of interesting task-oriented community based phenomena: that developers in a community present similar W-T patterns, and this clustering of W-T patterns is enhanced with time, reflecting different work cultures in these communities, with emphasis on different proportions of continuous work to continuous talk activities; that social and technical interactions may play a role in synchronizing W-T patterns, since developers with stronger social or technical links in a community have more similar W-T patterns; and that although successful task-communities may have relatively different cultures, developers with balanced work-talk patterns seems to play critical roles in sustaining them, and, at least in the ones we studied, each has at least one such developer.
These findings suggest that online individuals may synchronize their behaviors with others to better fit in the task communities and to improve coordinating efficiencies.The methods proposed in this paper can be further expanded and applied to analyze the switching pattern of more varied kinds of activities in more diverse online communities.

Hidden Markov Model
An HMM is a simple modeling mechanism to explain transitions among several different states.We use an HMM with two states,"work" and "talk", and transitions between them corresponding to either continuing to perform the same activity, W followed by a W or T followed by a T, or switching activities, W followed by a T, and vice versa.The HMM diagram is shown in Figure 2. If we denote by P W (k) and P T (k) the probabilities that work, resp.talk, happen at time step k, then for the next time point we have where α and β are the transition probabilities.We note here that while α and β could evolve with time, they don't change much between successive activities, therefore we can consider them as constants in the sequences with certain lengths.Equations ( 10) and ( 11) can be approximated for continuous time, τ , and then transformed to the following more compact matrix form: with P (τ ) = [P W (τ ), P T (τ )] T .By solving equation ( 12), we have The fractions of work and talk activities, P W and P T , in a sequence with length L can be estimated by By substituting equation ( 13) into ( 14), we have In the right side of equation ( 15), the first term is negligible when the sequence is long enough, considering α + β < 2. Since it is always satisfied P W + P T = 1, we have which are fully determined by the two parameters in the model.Then, the probabilities for the four different twopatterns in the sequence, in terms of α and β, are given by: Intuitively, larger α and β means higher proportions of WW and TT patterns, respectively, in the sequence.Furthermore, the probabilities for longer patterns can be calculated similarly, once the model parameters α and β are estimated from equations (17) to (20).It is important to note that we always have α + β = 1 in the randomized W-T sequences generated by the null model.In this case, α and β are equal to the fractions of work and talk activities, respectively.

Hazard Model
Survival analysis enables modeling of outcomes in the presence of censored data.In our case the censoring is due to the uncertainty that long time periods without activities may or may not indicate a developer has left the community.Generally, survival analysis involves calculating the Hazard rate, defined as the limit of the number of events per δt time divided by the number at risk, as δt → 0. Here, suppose a developer does not leave the community until time Γ, the Hazard rate is given by Our primary interest is the survival function defined as S(t) = P (t < Γ), which can be calculated from equation (21) by Suppose there are several factors, denoted by x i , i = 1, 2, . . ., γ, that can influence the survival time, then we adopt the Cox model [18] to define the Hazard rate h(t) by with h 0 (t) describing how the hazard changes over time at baseline levels of covariates.Here we focus on the hazard ratio h(t)/h 0 (t) to see whether increasing some covariates will significantly increase or decrease the survival time, e.g., b i > 0 means that the individuals of larger x i will have statistically shorter survival times.

3Figure 3 :
Figure 3: The box-and-whisker diagram for the preferences of the four different two-patterns in the real W-T sequences under the different time-interval conditions by comparing with the random ones.

Figure 4 :
Figure 4: The box-and-whisker diagram for the relative errors of the eight different three-patterns introduced by the random mechanism and the HMMs, comparing with the real ones.

Figure 5 :
Figure 5: Visualization of developers on α-β plane by considering their whole sequences, where developers are point and those of the same communities are marked by the same symbols.The parameters are grouped into three clusters by the "K-means" method.The base line is formed by the HMM parameters of the random W-T sequences with different fractions of work activities.The points are fitted by the linear function α + β = ε, with ε = 1.38.

Figure 6 :
Figure6: The centers and the respective diversities (the large circles) of the eleven communities on α − β plane, defined as the medians of the HMM parameters of their developers and the average distances of HMM parameters between the developers and the corresponding centers, respectively.

Figure 8 :Figure 9 :
Figure 8: The average inner distances of HMM parameters between pairwise developers for the fourteen communities.

3 Figure 10 :
Figure 10: The effects of community culture on individual properties.The box-and-whisker diagrams for (left) the four individual properties X 1 to X 4 , and (right) the observed survival time X 5 with different time thresholds T 1 (half year), T 2 (one year), and T 3 (two years), for the developers in the three clusters determined by their HMM parameters.

Table 1 :
Basic properties of the fourteen OSS communities.

Table 2 :
The student's t-tests for five individual properties between different clusters.

Table 3 :
The Pearson & Spearman correlation tests for the distances of HMM parameters and social & cooperative weights between pairwise developers for different communities.