Crowdsourcing Dialect Characterization through Twitter

doi:10.1371/journal.pone.0112074

Figure 1.

Spanish tweet locations.

The overwhelming majority of Spanish tweets are located in Spain and Spanish America but significant contributions arise in certain US states and major Western European and Brazilian cities.

More »

Expand

Figure 2.

Geographical distribution of the dominant word for the concepts ‘computer’ (left) and ‘car’ (right).

Map locations are colored according to the most common expression found in the corresponding cell. The area of the circle is proportional to the number of tweets.

More »

Expand

Figure 3.

Cumulative variance explained as a function of the number of components.

With components (vertical blue line) we are able to maintain over of the variance present in the data while significantly reducing the matrix size.

More »

Expand

Figure 4.

Characterization of the two superdialects.

A) and silhouette statistics as a function of . B) Geographical representation of the two clusters, (red) and (blue). For visualization purposes we increased the size of each cell. The name of main cities corresponding to superdialect are shown for clarity. C) Population distribution of the cells corresponding to each cluster.

More »

Expand

Figure 5.

Characterization of major cluster β.

Geographical representation of regional dialects. For visualization purposes we increased the size of each cell. Three well separated regions are indicated with dashed lines along with the top 3 dominant words characteristic of that region.

More »

Expand