Has Large-Scale Named-Entity Network Analysis Been Resting on a Flawed Assumption?

doi:10.1371/journal.pone.0070299

Figure 1.

Illustration of splitting 1 author into 2 authors based on a name variant alone.

The bold arrow separating the 2 network diagrams indicates the direction of change: before, to the left; after, to the right. Hjortaas M[J] is split into Hjortaas M and Hjortaas MJ based on last name, both initials. Note that the split would not have occurred if last name, first initial had been the criterion. Note also that the artificial vertices created by the split do not separate completely in the sense that Hjortaas M and Hjortaas MJ continue to share some co-authors. This is real data from PubMed; but the network measures regard the present, local network only.

More »

Expand

Figure 2.

Illustration of lumping 2 authors using only last name, both initials.

The bold arrow separating the 2 network diagrams indicates the direction of change: before, to the left; after, to the right. Note that Jon A. Kenniston and Julia A. Kenniston had no common co-authors before lumping. Lumping introduces a cutpoint as 2 connected components become biconnected. This is real data from PubMed; but the network measures regard the present, local network only.

More »

Expand

Table 1.

Extent of distortion caused by name = identity assumptions.

More »

Expand

Table 2.

The effect of splitting and lumping on precision and recall for PubMed (2003-2007) and USPTO (2003-2007) with respect to disambiguated networks.

More »

Expand

Table 3.

The number of operations required for each simulation in Figures 3, 4, 6, and 7 corresponding to the number of name instances eligible for identity change.

More »

Expand

Figure 3.

Change in clustering coefficient and degree assortativity given splitting and lumping of PubMed authors (2003–2007) and USPTO inventors (2003–2007).

In each subfigure, the x axis denotes the state of completion for splitting and lumping separately; the y axis represents the value of each labeled statistic. Each line segment (differentiated by color and style) plots 100 separate snapshots of the underlying network taken at even intervals for each set of operations. Splitting is based on last name, both initials. See Table 3 for the number of operations required. The global clustering coefficient is due to Equation 1; the mean local clustering coefficient to Equation 2. Degree assortativity is calculated as the correlation coefficient (corr coeff) with linear scaling and, separately, log-based scaling of degree.

More »

Expand

Figure 4.

Change in triangles and connected triples given splitting and lumping of PubMed authors (2003–2007) and USPTO inventors (2003–2007).

In each subfigure, the x axis denotes the state of completion for splitting and lumping separately; the y axis represents the value of each labeled statistic. Each line segment (differentiated by color and style) plots 100 separate snapshots of the underlying network taken at even intervals for each set of operations. Splitting is based on last name, both initials. See Table 3 for the number of operations required.

More »

Expand

Figure 5.

Degree distribution and its relationship with the local clustering coefficient and degree assortativity.

Each point represents the average of a set of authors (inventors) with identical degree. The points near the dashed diagonal reflect the influence of hyper-authorship.

More »

Expand

Table 4.

Comparison of different ways of measuring the clustering coefficient and degree assortativity.

More »

Expand

Figure 6.

Change in density, the proportion of cutpoints, and average shortest path given splitting and lumping of PubMed authors (2003–2007) and USPTO inventors (2003–2007).

In each subfigure, the x axis denotes the state of completion for splitting and lumping separately; the y axis represents the value of each labeled statistic. Each line segment (differentiated by color and style) plots 100 separate snapshots of the underlying network taken at even intervals for each set of operations. Splitting is based on last name, both initials. See Table 3 for the number of operations required.

More »

Expand

Figure 7.

Change in measures of components given splitting and lumping of PubMed authors (2003–2007) and USPTO inventors (2003–2007).

In each subfigure, the x axis denotes the state of completion for splitting and lumping separately; the y axis represents the value of each labeled statistic. Each line segment (differentiated by color and style) plots 100 separate snapshots of the underlying network taken at even intervals for each set of operations. Splitting is based on last name, both initials. Differences in the mean size of biconnected components between PubMed and USPTO suggest a cause of the unexpected behavior of cutpoints in Figure 6. See Table 3 for the number of operations required.

More »

Expand

Figure 8.

Cumulative distributions of collaborator counts (degree) for PubMed (2003–2007) and USPTO (2003–2007).

Note that in both cases, the disambiguated data exhibits much more curvature than for the name = identity assumption.

More »

Expand

Figure 9.

Distributions of collaborator counts (degree) conditioned on paper and patent counts for PubMed (2003–2007) and USPTO (2003–2007).

Papers and patents with 20 or more authors or inventors are excluded. Lumping error is visible in the upper row of plots as the name = identity assumption inflates collaborator counts. For PubMed, 280,446 (9%) authors have 4 co-authors over the period; for USPTO, 93,540 (17%) inventors have no co-inventors. For authors with 1 paper, 3 co-authors is the mode; for authors with over 10 papers, 33 co-authors.

More »

Expand

Table 5.

Basic properties of the PubMed (2003-2007) and USPTO (2003-2007) networks along with power-law fits of their degree distributions.

More »

Expand