In this study we examine linguistic variation and its dependence on both social and geographic factors.

Studies on Arabic Dialectology and Sociolinguistics

In this study we examine linguistic variation and its dependence on both social and geographic factors. We follow dialectometry in applying a quantitative methodology and focusing on dialect distances, and social dialectology in the choice of factors we examine in building a model to predict word pronunciation distances from the standard Dutch language to Dutch dialects. We combine linear mixed-effects regression modeling with generalized additive modeling to predict the pronunciation distance of words.

Although geographical position is the dominant predictor, several other factors emerged as significant. The model predicts a greater distance from the standard for smaller communities, for communities with a higher average age, for nouns as contrasted with verbs and adjectives , for more frequent words, and for words with relatively many vowels.

The impact of the demographic variables, however, varied from word to word. For a majority of words, larger, richer and younger communities are moving towards the standard.

For a smaller minority of words, larger, richer and younger communities emerge as driving a change away from the standard. Similarly, the strength of the effects of word frequency and word category varied geographically. The peripheral areas of the Netherlands showed a greater distance from the standard for nouns as opposed to verbs and adjectives as well as for high-frequency words, compared to the more central areas.

Our findings indicate that changes in pronunciation have been spreading in particular for low-frequency words from the Hollandic center of economic power to the peripheral areas of the country, meeting resistance that is stronger wherever, for well-documented historical reasons, the political influence of Holland was reduced. Our results are also consistent with the theory of lexical diffusion, in that distances from the Hollandic norm vary systematically and predictably on a word by word basis.

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Competing interests: The authors have declared that no competing interests exist. In this study we integrate the approaches of two fields addressing linguistic variation, dialectometry and social dialectology.

Dialectology is the older discipline, where researchers focus on a single or small set of linguistic features in their analysis. Initially the focus in this field was on dialect geography [1, Ch. Later, dialectologists more and more realized the importance of social variation. The work of Labov and later Trudgill has been very influential in this regard [2] , [3]. Social dialectologists have often examined both social and linguistic influences on individual linguistic features, generally using logistic regression designs [4] , but more recently also using mixed-effects regression modeling [5].

Since then other researchers, among others, Goebl, Heeringa and Nerbonne, and Kretzschmar, have refined the computational and quantitative techniques to measure and interpret these aggregate dialect distances [8] — [10]. We follow dialectometry in viewing linguistic distance for hundreds of individual words as our primary dependent variable.

Of course there are some exceptions in which for example the diachronic perspective is taken into account [12] , [13] , or age and gender are considered as covariates [14] , but to our knowledge no dialectometric study has attempted to model the effects of multiple geographic and social variables simultaneously. Dialectometry has also been criticized for focusing too much on the aggregate level of linguistic differences [15] , [16] , thereby neglecting the level of linguistic structure where individual words and linguistic properties are important.

Acknowledging honorable exceptions [11] , we concede that the focus in dialectometry has been on aggregate levels, but the strength of the present analysis is that it focuses on individual words in addition to aggregate distances predicted by geography. This quantitative social dialectological study is the first to investigate the effect of a range of social and lexical factors on a large set of dialect distances.

In the following we will focus on building a model to explain the pronunciation distance between dialectal pronunciations in different locations and standard Dutch for a large set of distinct words. Of course, choosing standard Dutch as the reference pronunciation is not historically motivated, as standard Dutch is not the proto-language.

However, the standard language remains an important reference point for two reasons. First, as noted by Kloeke, in the 16th and 17th centuries individual sound changes have spread from the Hollandic center of economic and political power to the more peripheral areas of the Netherlands [17]. Furthermore, modern Dutch dialects are known to be converging to the standard language [13] , [18, pp. Kloeke also pointed out that sound changes may proceed on a word-by-word basis [17].

The case for lexical diffusion was championed by Wang and contrasts with the Neogrammarian view that sound changes are exceptionless and apply to all words of the appropriate form to undergo the change [19].

The Neogrammarian view is consistent with waves of sound changes emanating from Holland to the outer provinces, but it predicts that lexical properties such as a word's frequency of occurrence and its categorial status as a noun or verb should be irrelevant for predicting a region's pronunciation distance to the standard language. In order to clarify the extent to which variation at the lexical level co-determines the dialect landscape in the Netherlands, we combine generalized additive modeling which allows us to model complex non-linear surfaces with mixed-effects regression models which allow us to explore word-specific variation.

First, however, we introduce the materials and methods of our study. The Dutch dialect data set contains phonetic transcriptions of words in locations in the Netherlands.

Figure 1 shows the distribution of the locations over the Netherlands together with the province names. The transcriptions in the GTRP were made by several transcribers between and , making it currently the largest contemporary Dutch dialect data set available.

The word categories include mainly verbs The complete list of words is presented in [13]. For the present study, we excluded 3 words of the original set gaarne , geraken and ledig as it turned out these words also varied lexically.

The standard Dutch pronunciation of all words was transcribed by one of the authors based on [21]. Because the set of words included common words e. Besides the information about the speakers recorded by the GTRP compilers, such as year of recording, gender and age of the speaker, we extracted additional demographic information about each of the places from Statistics Netherlands [23].

We obtained information about the average age, average income, number of inhabitants i. As Statistics Netherlands uses three measurement levels i.

For large cities e. Finally, for very small villages located in a district having multiple small villages, the neighborhood was selected which consisted of the single village e. For all locations, the pronunciation distance between standard Dutch and the dialectal pronunciations was calculated by using the Levenshtein distance [24].

The Levenshtein distance minimizes the number of insertions, deletions and substitutions to transform one pronunciation string into the other. The regular Levenshtein distance does not distinguish vowels and consonants and therefore may align a vowel with a consonant. To enforce linguistically sensible alignments, a syllabicity constraint is normally added such that vowels are not aligned with non-sonorant consonants. As shown in the example above, the Levenshtein distance increases with one for every mismatch.

Some sounds, however, are phonetically closer to each other than other sounds, e. A distance measure for two pronunciations should reflect this. Pairs of sounds which are aligned relatively frequently are assigned a low distance, while sounds which co-occur relatively infrequently are assigned a high distance. The method is based on calculating the Pointwise Mutual Information score PMI; [26] between every pair of sounds and was found to improve alignments compared to the Levenshtein distance with and without the syllabicity constraint.

In addition, a recent study by Wieling, Margaretha and Nerbonne submitted found that the automatically determined PMI distances between vowels correspond well with acoustic vowel distances for several languages. A detailed description about the PMI method can be found in [27]. As an illustration of the PMI method, consider the alignment of [tei] and [twa], now using the PMI-based costs: In contrast to the previous example, the [a] can only be aligned with [e], as the cost of aligning [a] and [i] is higher and the cost of deleting [e] is higher than deleting [i].

In the following, the pronunciation distances are based on the PMI-based Levenshtein distance. Because longer words will likely have a greater pronunciation distance as more sounds may change than shorter words, we normalize the PMI-based word pronunciation distances by dividing by the alignment length. Given a fine-grained measure capturing the distance between two pronunciations, a key question from a dialectometric perspective is how to model pronunciation distance as a function of the longitude and latitude of the pronunciation variants.

The problem is that for understanding how longitude and latitude predict pronunciation distance, the standard linear regression model is not flexible enough. The problem with standard regression is that it can model pronunciation distance as a flat plane spanned by longitude and latitude by means of two simple main effects or as a hyperbolic plane by means of a multiplicative interaction of longitude by latitude.

A hyperbolic plane, unfortunately, imposes a very limited functional form on the regression surface that for dialect data will often be totally inappropriate. We therefore turned to generalized additive models GAM , an extension of multiple regression that provides flexible tools for modeling complex interactions describing wiggly surfaces.

For isometric predictors such as longitude and latitude, thin plate regression splines are an excellent choice.

Thin plate regression splines model a complex, wiggly surface as a weighted sum of geometrically simpler, analytically well defined, surfaces [28]. The details of the weights and smoothing basis functions are not of interest for the user, they are estimated by the GAM algorithms such that an optimal balance between undersmoothing and oversmoothing is obtained, using either generalized cross-validation or relativized maximum likelihood see [29] for a detailed discussion.

The significance of a thin plate regression spline is assessed with an F -test evaluating whether the estimated degrees of freedom invested in the spline yield an improved fit of the model to the data. Generalized additive models have been used successfully in modeling experimental data in psycholinguistics, see [30] for evoked response potentials, and see [31] — [33] for chronometric data.

They are also widely used in biology, see, for instance, [34] for spatial explicit modeling in ecology. For our data, we use a generalized additive model to provide us with a two-dimensional surface estimator based on the combination of longitude and latitude of pronunciation distance using thin-plate regression splines as implemented in the mgcv package for R [29]. Figure 2 presents the resulting regression surface using a contour plot. The solid contour lines represent distance isoglosses.

Darker shades of gray indicate smaller distances, lighter shades of gray represent greater distances from the standard language. The contour plot shows a regression surface of pronunciation distance as a function of longitude and latitude obtained with a generalized additive model using a thin plate regression spline.

The black contour lines represent distance isoglosses, darker shades of gray indicate smaller distances closer to the standard language, lighter shades of gray represent greater distances. Note that the empty square indicates the location of the IJsselmeer, a large lake in the Netherlands.

The general geographic pattern fits well with Kloeke's hypothesis of a Hollandic expansion: As we move away from Holland, pronunciation distances increase [17]. Kloeke showed that even in the sixteenth and seventeenth centuries the economic and political supremacy of the provinces of North and South Holland led to the spread of Hollandic speech norms to the outer provinces.

We can clearly identify the separation from the standard spoken in the provinces of North and South Holland central west of the province of Friesland in the north , the Low Saxon dialects spoken in Groningen and Drenthe in the northeast , and the Franconian dialects of Zeeland in the southwest and Limburg southeast. The The local cohesion in Figure 2 makes sense, since nearby locations tend to speak dialectal varieties which are relatively similar [35].

A problem with this generalized additive model is that the random-effects structure of our data set is not taken into account. In mixed-effects regression modeling for introductions, see, e. Fixed-effect factors are factors with a small number of levels that exhaust all possible levels e. Random-effect factors, by contrast, have levels sampled from a much larger population of possible levels.

In our data, there are three random-effect factors that are likely to introduce systematic variation that is ignored in our generalized additive model. A first random-effect factor is location. Our observations are made at locations where speakers were interviewed. Since these locations are a sample of a much larger set of communities that might have been sampled, location is a random-effect factor.

Because we used the pronunciations of a single speaker at a given location, location is confounded with speaker. Hence, our random-effect factor location represents both location and speaker.

The data obtained from the locations were coded phonetically by 30 different transcribers. Since these transcribers are themselves a sample of a larger set of possible transcribers, transcriber is a second random-effect factor in our model. By including transcriber in our model, we can account for biases in how individuals positioned the data that they listened to with respect to the standard language.


Part of the Modern Linguistics Series book series. Skip to main content Skip to table of contents. Advertisement Hide. This service is more advanced with JavaScript available. Sociolinguistics A Reader. Front Matter Pages i-xi.

Skip to search form Skip to main content You are currently offline. Some features of the site may not work correctly. Trudgill Published Psychology, Geography. A central probletn in considering the subjects of sociolinguistics and dialectology has to do with the relationship between these two topics, which has often been somewhat dlflcult and controversial. Is, for example, dialectologj part of sociolinguistics, or is it a separate discipline? Save to Library. Create Alert.

We have just seen that the discussion of when a form of language can be considered the same language as another or a different one is complex because social factors, such as politics and history, become involved. Other social factors besides these two also get involved, including geography, identity, and individuality. We will now explore how some of these various factors relate to language forms. Earlier we described dialects as produced by systematic distinctions in language form and use between groups. The ways people form groups, however, is quite varied, and we can identify several different types of dialects according to the different ways people divide into groups.


The study of DIALECTS , that is, of variant features within a language, their history, differences of form and meaning, interrelationships, distribution, and, more broadly, their spoken as distinct from their literary forms. The discipline recognizes all variations within the bounds of any given language; it classifies and interprets them according to historical origins, principles of development, characteristic features, areal distribution, and social correlates. The scientific study of dialects dates from the midc, when philologists using data preserved in texts began to work out the historical or diachronic development of the Indo-European languages.

Perceptual dialectology is a sub-branch of folk linguistics first systematized by Dennis Preston in the s e. Preston , Through the technique of mental mapping, borrowed from cultural geography, perceptual dialectologists seek to discover the perceived distribution of speeches, populations, and prevailing ideologies. The results show that the respondents are aware of the major linguistic boundaries within Egypt, although they did not pay the same attention to all areas: Siwa, the area around Marsa Matruh, the Nile Delta region and Upper Egypt were identified quite clearly by a great number of students, while the Sinai Peninsula was taken into account to a lesser extent and the oases were largely ignored. Our aim is to answer two main questions: first, we want to individuate whether or to what extent linguistic boundaries drawn by professional dialectologists correspond to those drawn by non-specialists; secondly, we are interested in the way speakers define the languages, varieties, or ways of speaking, whose existence they are aware of.

A dialect, on the other hand, is a particular form of a language which is unique to a specific region or social group. What is the difference between a "language" and a "dialect? Regional dialects do have some internal variation, but the differences within a regional dialect are supposedly smaller than differences between two regional dialects of the same rank. Dialect, on the other hand, points to a different way of perceiving the dominant language, and is not merely a difference in diction.


    Dialect , a variety of a language that signals where a person comes from.

