Isoglosses based on sound changes differentiating the West Iranian languages, a group comprising Persian, Kurdish, Balochi, and other speech varieties, have long been of interest to
linguists. The West Iranian languages are traditionally divided into Northwest (containing Kurdish, Balochi, etc.) and Southwest (containing Persian and closely related dialects) subgroups, the latter of which can be defined by a small number of phonological and morphological innovations that have taken place before the attestation of Old Persian, its oldest member. At the same time, a comparable (if not larger) number of Persian innovations have taken place after Old and Middle Persian, and similar innovations can be seen in other West Iranian languages, showing the effect of complex areal networks that have existed during the development of these languages. A number of reflexes can be identified as Southwest Iranian or Northwest Iranian on the basis of the languages in which they occur; however, language contact has complicated the picture significantly. In some cases, it is not clear what the “correct” outcome of a given Proto-Iranian sound should be; for instance, in the word for ‘spleen’ (Proto-Iranian *spr
), Kurdish shows what is thought to be a typically SWIr outcome (*r/r
Persian supurz shows a typically NWIr outcome (*r/r
ing individual words as loans in specific languages, but some of these diagnostics are better founded than others; in general, the picture is often so noisy, and these heuristics are so tightly intertwined, that all the facts cannot be qualitatively resolved within a traditional comparative-historical framework. I propose an alternative way of analyzing West Iranian data that integrates insights from the comparative method with probabilistic modeling. While previous research has tended to make hard decisions regarding a language’s regular reflexes of sound change, this study avoids this approach; instead, I employ a quantitative approach intended to let regular behavior fall out of the data. This paper investigates this variation in historical phonology across etymological reflexes and languages on a large scale. Specifically, I use a Bayesian probabilistic model, the Hierarchical Dirichlet Process (HDP), to reduce the dimensionality of the data seen within and across languages into a set of latent, unobserved components representing dialect membership which can be shared by multiple languages. The HDP is non-parametric, meaning that there is no upper bound on the number of latent features inferred. Both languages and phonological variants are associated with the presence of a latent feature. This allows us to identify potential networks of language contact across the dataset. This methodology sheds light on a number of unresolved issues in the literature on West Iranian dialectology. I find, unsurprisingly, that West Iranian languages show admixture (to differing degrees) from two major dialect components, roughly corresponding to Northwest and Southwest Iranian dialect groups. I provisionally resolve a small number of questions regarding the dialectal provenance of certain types of sound change; while the impact of this paper’s results is somewhat limited due to the relatively small size of the data used, the results are interpretable, and the methodology I use is promising. I discuss future directions for models of this sort. This paper is structured as follows: Section 2 introduces the concept of West Iranian and the languages surveyed by this paper. Section 3 gives a condensed overview of the problems posed by West Iranian historical phonology for the traditional comparative method of his-
torical linguistics. Section 4 provides an in-depth (but not exhaustive) catalogue of sound changes in West Iranian languages, pinpointing a few problematic examples where traditional methodologies have led researchers to potentially unjustified conclusions regarding language contact within Iranian. Sections 5–7 provide a conceptual overview of the HDP and related models, highlighting their relevance to the issues outlined in the previous section; an outline of the rationale for the model used in this paper; and a high-level description of the inference procedure employed. Results and discussion follow in sections 8–9; technical details of the model employed can be found in the Appendix.
The Iranian languages are traditionally divided into East and West subgroups, but the genetic status of these labels is shaky. Historically, Bartholomae (1883:1) divided Old Iranian into western and eastern variants, the former represented by Old Persian, and the latter by Avestan. The Grundriss der iranischen Philologie, particularly Wilhelm Geiger’s contribution, provides a great deal of information on dialectology and subgrouping of contemporary Iranian languages. In this work, the chief distinction that cuts across Iranian is between “Persian” and “Non-Persian” dialects (Geiger 1901:414). East and West are used as purely geographic labels: at one point, Balochi, generally classified in later work as a West Iranian language, is referred to as East Iranian (p. 414). There is not full agreement regarding which Iranian languages are western and which are eastern; the languages Ormuri and Parachi are considered to be West Iranian languages by some scholars (Grierson 1918, Oranskij 1977, Efimov 1986), but the consensus following Morgenstierne (1929) places them in East Iranian. The problematic nature of the geographic labels was noted at an early date by Bailey (1933). Sims-Williams (1996:651) states that East Iranian is better understood as a Sprachbund than a genetic grouping, as there are very few nontrivial innovations shared by all languages in this group. Wendtland (2009) finds that there are no secure shared phonological or morphological characteristics between the East Iranian languages, and argues against Northeast and Southeast subgroups (a division provisionally suggested in Morgenstierne 1926 and followed in Oranskij 1977, Kieffer 1989 and elsewhere). Cathcart (2015), Korn (2016, 2019) argue there are virtually no non-trivial innovations shared among West Iranian languages that could serve as diagnostics for subgrouping.
Regardless of the genetic status of West Iranian, the label is meaningful, not only from a typological standpoint (West Iranian languages are highly convergent in their morphosyntax) but in terms of many of the diachronic trends displayed by West Iranian languages. Contact with non-Indo-European linguistic stocks, such as Turkic and Semitic, may have aided in shaping the linguistic profiles of West Iranian languages (Stilo 2005, 2018). Even if there are no good shared genetic innovations among West Iranian languages, the study of inter-dialectal West Iranian contact has the potential to shed light on the socio-historical development of Iran and surrounding regions.
The West Iranian languages are traditionally divided into Northwest and Southwest groups. The Southwest group, comprising Old, Middle and New Persian, as well as closely related dialects such as Bashkardi, Kumzari, Judeo-Tati, and others, is generally viewed as a genetic subgroup, defined by a small number of innovations. The Northwest group has fewer subgroup-defining innovations uniting it. The finer details of this distinction are not of particular importance to this paper, as the major goal here is for dialectal groupings to fall out of the behavior displayed by languages in this paper’s sample, which are listed in Table 1.
West Iranian languages show a great deal of deviation from expected outcomes of historical phonology. This is clear in the oldest language; Old Persian contains a number of words which display the reflexes s, z, and sp and zb for Proto-Iranian (PIr) instead of the expected outcomes T, d, s and z. This has led scholars to draw a distinction between “proper Old Persian” and “Median” forms (cf. Hoffmann 1976:60ff.), the latter label an allusion to the confederation which preceded the Achaemenid Empire (with which Old Persian is associated). Although we can reliably identify only a single form as explicitly Median (
‘dog’, recorded by Herodotus, which shows the sound change
number of Old Iranian onomastic items are generally assumed to be Median.
Words containing irregular historical phonological reflexes are common in Middle and New Persian as well, and are generally ascribed to contact with Northwest Iranian languages (although there are several probable loans from East Iranian as well; see Sims-Williams 1989:167). Northwest Iranian languages show the same degree of irregularity and contain a number of clear loans from various chronological stages of Persian, which is not surprising, given the sociopolitical influence of the Persian language in Iranian antiquity and onward.
It is likely that a number of mechanisms have worked together to create the complex patterns seen across West Iranian. These include (but are not limited to) language-internal factors such as the following:
• Poorly understood conditioning environments: we may not fully understand the factors influencing regular sound change within languages
• Analogical change, including paradigmatic leveling and extension, contamination, etc.
Additionally, inter-language factors like the following are almost certainly involved:
• Borrowing of lexical items
• Lexical diffusion of sound changes
As mentioned above, most explanations of irregularity appeal to lexical borrowing, often from an identifiable source such as Persian. However, it is additionally possible that more than one dialectal source of similar-looking reflexes was involved (e.g., the change may not have been restricted to Persian); furthermore, it is possible that under the umbrella of widespread multilingualism, speakers imposed sound changes from one dialect onto words
Table 1: West Iranian languages in the data set, along with alternative names and sub- variants (in italics), sources from which information was taken, and compatible glottocodes (Hammarström et al. 2017). Frequently used abbreviations for language names are provided. Pre-modern languages are indicated with an asterisk, Southwest Iranian languages with Transcriptions for pre-modern forms follow the sources cited.
Figure 1: Approximate locations of languages in sample
from other speech varieties. In certain Mischformen it is quite clear that diffusion of sound changes was at work, rather than wholesale lexical borrowing. In many cases, however, it is not possible to distinguish between the two mechanisms. Additionally, it is not entirely clear whether similar-looking sound changes should be treated as unified, stemming from a single speech variant, or whether nearly identical sound changes were developed in parallel in different speech communities, possibly at different times. I identify some of these problems in the survey of sound change below, and propose a data-driven solution to at least some of these issues.
Below, I give a synopsis of historical phonological innovations in West Iranian languages (viewed through the lens of Persian, which has the best-documented historical record) and discuss outstanding problems. These developments are given in rough chronological order (where a chronology can be securely established), starting with innovations preceding Old Persian, and so on, focusing on some particularly vexing problems.
Dialectal differentiation is visible in the earliest attested West Iranian records, which consist of Achaemenid Old Persian inscriptions, as well as fragmentary Median records. At this stage, several phonological and morphological innovations that define Southwest Iranian as a subgroup can be identified (see below).
The locus classicus of West Iranian dialectal differentiation is Tedesco’s (1921) study of Middle Persian/Parthian isoglosses in the Manichean texts of Turfan. Lentz (1926) discusses dialectal variation found in the Š¯ah-n¯ama. In many cases this variation can be periodized with respect to when variant forms were introduced, especially in the case of Persian (cf. Paul 2005). The isoglosses identified in these works have served as the basis for a large number of dialectological investigations. Over the past century, the list of variables has been supplemented (Bailey 1933, Krahnke 1976, Stilo 1981), and scholars have debated which features in particular are the most meaningful for West Iranian dialectology (Paul 1998a, Korn 2003, Windfuhr 2009), in terms of joint versus independent innovations.
4.1 Changes to PIr *ć, *´
The changes Middle Persian, New Persian
are found in a stratum of Old Persian (OP) vocabulary, and thought to be the expected outcome in Southwest Iranian languages. However, OP also exhibits a number of doublets or irregular reflexes of the aforementioned Proto-Iranian sounds, usually ascribed (as mentioned above) to Median admixture, though we know little about the true nature of the Median language, given the paucity of records.
This variation is well attested in Old Persian: PIr in one layer of vocabulary, but s elsewhere; PIr
in (likely) the same stratum, but z elsewhere. This variation is described further below.
4.1.1 PIr *ć-
OP (or post-OP) initial T- consistently corresponds to Middle Persian (MP; I distinguish between Phl and MMP only for forms exhibiting variation between the two dialects)
The fact that OP T develops to MP h in most environments has led many scholars to assume that forms with MP s- are NW Iranian loans (cf. Gershevitch 1962a:2); however (Salemann 1901) takes initial MP s- to be the regular reflex of earlier T-. The development of PIr to h- is shown in a single form, NP hadba ‘centipede’, found in the Burh¯an-i Q¯at.¯ı‘, a 17th century dictionary (Morgenstierne 1932:55).
4.1.2 PIr *ću“
Reflexes of PIr are highly probative with respect to Iranian subgrouping; the “proper” Southwest Iranian outcome is taken to be s, while Khotanese Saka and Wakhi show
dish and Balochi, showing “transitional” behavior between Northwest and Southwest Iranian, appear to participate in the change
with Southwest Iranian, but not the changes
. In most Southwest Iranian dialects the change
must postdate the change
. Northwest Iranian languages (other than Kurdish and Balochi) and East Iranian languages (other than Khotanese Saka and Wakhi) show sp, or a sequence of sounds thought to descend from it, e.g., Ossetic fs; Khunsari, Gazi, Sangesari
Zoroastrian Dari sv is most likely secondary rather than an archaism preserving the glide
surfaces as sv as well, e.g.,
(Vahman and Asatrian 2002:21) The change
cannot be reconstructed to a hypothetical ancestor of the CentralIranian languages which share it (cf. Skjærvø 2009:50–51) without excluding Kurdish and Balochi from this group, but these languages cannot be placed in the Southwest Iranian group: in most Southwest Iranian dialects, the change
must postdate the change
probably had a phonetic value close to
only in highly marginal dialects, e.g.,
‘louse’ in Judeo-Shirazi (a dialect closely related to Persian but somewhat differentiated in terms of historical phonology), if from
(Morgenstierne 1960, Borjian 2020). It is likely that changes to
represent a sort of areal diffusion among (originally) non-peripheral Iranian languages, albeit an old one which has operated prior to early Median and Scythian onomastic items and is found in the archaic Avestan language as well. It is worth noting that similar fortition of OIA
has taken place in the peripheral Indo-Aryan language Khowar as well as some Nuristani languages, though Morgenstierne (1926, 1932) cautions against connecting these developments with the Iranian one.
Persian shows reflexes containing sp at all chronological stages. New Persian also shows the cluster sf. Henning argues that this cluster cannot be secondary from earlier *sp, and could instead be from a dialect in which PIr “resulted directly in sf” (Henning 1963:71, fn. 13). Schwartz (2006:223) argues against the influence of Arabic (which resulted in certain sporadic p > f changes, since Arabic lacks p) in certain words with sf. The circumstances under which NP sf came about remain unclear.
4.1.3 PIr *ćr
A small number of Old and Middle Persian words show OP ‘restore’
caus. (contaminated with
, according to Kent 1951:188), MP
‘conveying, dispatch’ (Cheung 2007:355). Kent (1942:80) claims that OP
nautiy ‘hear’ 3sg (
) yields NP
, but the latter form is better connected with
‘hear’ (Cheung 2007:456). Elsewhere, Middle and New Persian show s(V)r and on occasion
PIr ‘hear’ (caus.) > MP
(Cheung 2007:357) PIr
‘buttocks’
PIr ‘mother-in-law’ (possibly under influence from
in-law’, as suggested by a reviewer) PIr
‘teardrop’
4.1.4 PIr *ćn, *´n
Initial Proto-Iranian appears to surface as sn- in Iranian, though evidence is restricted to reflexes of one etymon in two languages (YAv
Cheung 2007:349); medial
‘will, favor’ (OAv
‘wish’ inst.sg. may show the effects of analogy rather than a regular outcome; see Hoffmann and Forssman 2004:102 as well as Schwartz 2010 for an alternative view).
PIr appears to have become OP
) word-initially (e.g., PIr
‘recognize’) and
word-medially (e.g.,
‘festival’,
dance’). From what we can tell, Northwest Iranian languages appear to have medial -zn- (on metathesis to -nz- in Median onomastic items, see Gershevitch 1962a), e.g., Parthian gazn ‘treasure’ vs. NP
‘abundance’
; Persian shows some forms with -zn-, possibly loans from Northwest or East Iranian, e.g., NP gavazn, cf. Sogdian
, Khotanese Saka
(Gershevitch 1954:57). At the same time, Northwest Iranian languages appear to agree with Persian in reflecting
for initial
etc. (Cheung 2007:466).
4.1.5 PIr *´u“
The sequence is found in only a small number of Proto-Iranian etyma. Old Persian contains reflexes of only two of these forms, patiyazbayam ‘proclaim’ 1sg. impf. (
‘tongue’ (
). The latter form is believed by
4.1.6 PIr *ći“
The development of PIr in Persian is not entirely clear. Its fate is intertwined with that of the cluster
), whose regular Old Persian reflex is thought to be
1951:32); cf.
Old Persian attests the cluster only word-medially, where it surfaces as Tiy (show-ing characteristic OP anaptyxis between consonants and glides), e.g.,
‘house’ loc.sg., possibly via paradigmatic leveling of the stem
). Old Persian does not directly attest this cluster word-initially; Middle Persian shows varying reflexes:
PIr ‘a fabulous bird’
(MacKenzie 1971:74).Young Avestan
‘eagle’ may show a dissimilation
the presence of the off-glide of the diphthong ai, which may also account for Persian s, but this development was clearly not pan-Iranian, since the initial consonant of Balochi
‘falcon, hawk’ (Korn 2005:129) cannot continue PIr
Word-internally, Middle and New Persian reflect variation between earlier
invokes a rhythmic law proposed by Klingenschmitt (ibid.) to account for phonological irregularities in Middle Persian nouns. Armed with these ideas, we can account for some of the variation within Persian, if we assume the pre-forms sus
(the stress placement assumed here follows Back 1978:30ff.), but this still does not explain h in reflexes of
, which should have undergone the same development as
, as noted by Gershevitch.In Persian, as elsewhere in West Iranian, language contact, analogy, and prosodically conditioned change have interacted to bring about the complex variation seen in reflexes of PIr
and related sounds. The limited knowledge we have of late Old Persian prosody can help to tease out the role of the last mechanism, but only to a certain extent. It is tempting to account for variation in West Iranian languages with no documented history in a similar manner, but this is purely speculative. An instructive example is the following thought experiment: Korn (2005:284) contends while discussing Balochi
tortoise’ that “a genuine Bal. word should show
.” However, given that Balochi
reflects PIr
(cf. Korn 2005:105), a pre-Balochi form
is not inconceivable on historical phonological grounds, but perhaps overly speculative since we know virtually nothing about the phonotactics, syllabification, and stress pattern of Balochi’s precursor. However, we also do not know whether
is the Balochi reflex of
across the board (as assumed by Korn), or only in specific environments. Ultimately, we may benefit from relaxing some of these assumptions and employing a probabilistic model that allows us to make generalizations regarding languages’ diachronic behavior on the basis of intra-language and inter-language distributions of sound changes.
4.2 PIr *T
PIr *T changes into OP T (> MP, NP h) in most conditioning environments, though it may develop into MP s- word-initially, e.g., PIr *Taxta- (cf. Khwarezmian The change
is well established, as is
, though numerous exceptions to these developments exist as well.
4.2.1 PIr *Tn
There are relatively few Proto-Iranian sources of the cluster *Tn, but these are realized as across the board in West Iranian, to the exclusion of the possible Median proper name in Akkadian
‘looking for a wife’ (Tavernier 2007:273).
PIr *-i-Tna- > MP abstract noun suffix ; cf. Zazaki infinitive suffix
(Benveniste 1935:105)
Middle Persian ‘elbow’, a doublet with
‘cubit’, is most likely a loan from a source closely related to Sogdian (cf.
wife’ is to be connected with
(Tafazzoli 1974:119; Monchi-Zadeh 1990:134) it perhaps shows a secondary change
seen in some other words.
4.3 PIr *št, *žd
Old, Middle and New Persian (along with other Iranian languages) show variation between ; later language attests variation between
‘reward’, NP
). There is disagreement as to whether OP
is due to analogy (Kent 1951:34) or a sound change defining Southwest Iranian (Skjærvø 1989), and what the relationship of this behavior is to similar-looking developments in the later language. Lipp (2009:196ff.) states that OP -st- (found as a reflex of PIE
) is due to analogy, while other developments are due to a phonological change predating Middle Persian:
1. PIE *h
2. PIr
3. PIr (superlative suffix) > MP -ist; e.g., Phl
‘sweetest’
(cf. Iron Ossetic xorz, Digor Ossetic
4.4 *r + coronal change
4.4.1 Change to l
A number of West Iranian forms show a sound change whereby *r + coronal sequences become l. This behavior is common in Middle and New Persian, perhaps representing a regular sound change which operated between Old and Middle Persian:
In some cases, this development has operated across an intervening vowel, likely unstressed:
(MMP ‘leader’ (cf. NP
, perhaps a later compound) PIr
‘pack-saddle’ (cf. Sogdian
‘saddle’, cf. Sims-Williams
However, this development is not exceptionless: it does not operate in forms like NP padarzah ‘a wrapper in which clothes are folded up’, if from (Cheung 2007:63, marked as a loanword perhaps due to
), which appears to have undergone a dissimilatory development
that is not paralleled in
. It is unlikely that language-internal factors (viz., different conditioning environments) can account for the entire range of variation seen within Persian.
Persian. However, there are a large number of exceptions to this rule within Middle and New Persian; for example, NP buland forms a doublet with burz, thought to represent a Northwest Iranian form (Beekes 1997:3). For some etyma, Persian lacks l, while a nonPersian reflex displays it, e.g., NP supurz ‘spleen’ versus Kd The uncertainty surrounding this behavior can be summed up by the following comment by MacKenzie (1961:78) on the outcome of PIr
in Kurdish: “I do not think it is possible to be certain which is the true Kurdish development, but whether we consider the many words with
as native or loan-words their preponderance is significant.” Gurani contains the forms
, suffixationunclear), which cannot be Persian loans; in the first, the change to OP d predates lateralization of *r/r
triggered by the following *-r , which subsequently underwent lateralization (e.g.,
, see §4.6).
in these forms owes itself to Persian influence as opposed to some other source is unclear.
4.4.2 *rn > rr
The change *rn > rr is attested in Middle Persian and onward, as seen in the following examples:
It has been suggested that the changes *rn > rr and *rn > l (see above) are interconnected, and that variation in reflexes of *rn represents dialectal variation within West Iranian (Schwartz 1971:292, fn. 14).
Middle and New West Iranian languages as a whole show an overwhelming tendency toward the change *rn > r(r). West Iranian words for ‘lamb’, if reflexes of PIE
; Mayrhofer 1992:225–6), show this behavior across the board:
MP ; S Bashkardi
However, Balochi and Parthian forms show the change *rn > n(n); Zazaki shows rn only via analogical maintenance or restoration, but otherwise (Korn 2005:133–4).
4.5 r ∼ l variation
Proto-Indo-European *l surfaces as r in the vast majority of Iranian languages. PIE *l > *r is often given as a Proto-Iranian sound change in most handbooks, yet there are a number of exceptions to this development (Schwartz 2008), indicating that PIE *l has been conserved in some peripheral dialects. Northwestern dialects also contain morphological variants with l lost by Persian with congeners in Indic, e.g., Kashani , Mazandarani engel (cf. Old Indic
) against NP
) ‘finger’ (Horn 1893; Krahnke 1976:226–8).
However, some cases of West Iranian l may be secondary rather than archaic (Hübschmann 1895:262ff.). It is not clear, for example, where forms like S. Tati (Ebrahim-abadi)
(Sagz-abadi)
‘elm’ (Yar-shater 1969:71)
Similarly, one finds S. Tati
(Yar-shater 1969:71), Vidari
(Baghbidi 2005:36)
‘fig’ (forms elsewhere in Iranian point to *r, e.g., Sogdian
; Gharib 1995:37). For ‘worm’, the evidence clearly points to an Indo-Iranian etymon *kr(i)mi- containing r, and any instances of l in Iranian languages should be secondary (e.g., Ossetic
shows expected *r > l change in anticipation of
innovations are also found in Kurdish valg, Judeo-Tati velg (Miller 1892), etc. = NP barg <
(Horn 1893:47); this variant surfaces in the Dari dialect of New Persian as balg (Korn 2005:160). Non-archaic l can also be found in NP
‘hunt’ vs. Bandari, Bakhtiyari
Bashkardi
‘mountain sheep’, if from a verbal root *skar- with no good Indo-European cognates (Cheung 2007:346). Ultimately,
variation across West Iranian is due not only to preservation of original PIE *l, but also a secondary change to l from original *r, especially evident in loans originally from non-Iranian languages, e.g., Judeo-Isfahani
NP karafs (Stilo 2007)
Arabic. We can be sure of the directionality in cases where there is secure evidence from outside of Indo-Iranian, but in the absence of such information, it can be difficult to tease apart primary and secondary l; it is equally unclear whether all variant pronunciations stem from the same dialectal source.
4.6 Changes to PIr *u“-
Reflexes of PIr are characterized by a high degree of irregularity across West Iranian.
Developments within Persian serve to demonstrate the complexity of these developments. Proto-Iranian
surfaces as Middle Persian
, but is otherwise unchanged in Middle Persian (with a few stray exceptions; see below):
PIr ; Bakhtiyari gosne;Balochi (Marw)
; Larestani
; Mazandarani
‘qui a faim’ (showing the Central Iranian development
, Asatrian cf. 2012); Taleshi
Generally speaking, PIr
However, some exceptions exist:
In the following forms, PIr
Change to g- does not operate in the following words beginning with PIr but not all, have a grave (i.e., labial, labiodental, or velar) consonant later in the word:
Elsewhere, PIr
As is apparent, none of the sound laws sketched above is exceptionless. It is almost certain that contact between closely related dialects is responsible for some of the doublets seen above. But it is also clear that succinct generalizations regarding the behavior of PIr in different conditioning environments are hard to come by. This issue has not received a systematic treatment in the literature. Lentz (1926:280–1) seems to consider
regular Southwest Iranian outcome. MacKenzie (1971:76) takes the change
feature shared by Persian and Northern and Central Kurdish dialects, whereas “[i]n most other W.Ir dialects w- is little modified in this position, while in Bal. it has developed into
Attempts to establish the regular behavior of PIr for non-Persian West Iranian lan-guages have proved as difficult as for Persian. Early Judeo-Persian records, thought to typify a link between Middle and Modern Persian, present an equally challenging picture (Paul 2013:35ff.). An errant strain of Middle Persian shows g- for expected b-, e.g., Pazand guzurg : NP buzurg (Bailey 1933:56). A large number of West Iranian languages leave
more or less unmodified (surfacing as v, w or f but more importantly not merging with PIr *g-, *b-), but forms with g- and b- still preponderate. For instance, while Zazaki usually shows
‘wind’), the word for ‘blood’ is
(Paul 1998b). South Tati varga ‘leaf’ sits alongside
‘spring’ (Yar-shater 1969:95, 103, 110). The Kurmanji dialect of Kurdish shows a preference for b- where other languages do not, e.g.,
‘boar’ : NP
‘hungry’ : NP gurusnah (Soane 1913, Thackston nd, Chyet 2003), but elsewhere agrees with Persian, e.g.,
If a regular outcome can be established for a given non-Persian language, there is a tendency to assume that any words containing deviations from it are loans from Persian (though this approach is in general avoided by Korn 2005). For instance, Marw Balochi burz ‘mace’ (; note the metathesis identical to Persian) does not show expected g(w)-, hence, Elfenbein (1963:25) marks it as a “Persic” loan. However, there is no reason to expect NP b- in a reflex of a Middle Persian word with an initial syllable of the shape
unless a grave consonant is found later in the word (and if the sound law sketched above is accurate). The Northern Kurdish dialect Kurmanji does, as mentioned above; this behavior can be found sporadically in other non-Persian languages as well (e.g., Mamasani
pig’, Mann 1909:184). Given this evidence, these languages may be more viable donors for Balochi burz than Persian (the metathesis found in both of the forms is another question entirely).
4.7 Metathesis
Over the course of Persian history, more than one development of metathesis has taken place (Hübschmann 1895:266–7), involving the re-sequencing of word-final and some word-internal clusters ending in r (and on occasion l). By the advent of Middle Persian, we see narm ‘soft’ < *namra- and warz ‘club, mace’ . Fricative + r/l clusters (as well as some fricative + fricative clusters) have undergone metathesis after Middle Persian attestations:
Other West Iranian languages vary as to whether they show metathesis in the same words; this variation is often language internal:
PIr (in compounds); Gur varwa; Khun varf; Lar vafr,barf; Maz varf; Siv varf; Tal var; S Tati vara; Zaz vewr; Judeo-Tati
‘snow’ can be found in the materials of Miller (1892:59), but Authier (2012:323) gives verf. PIr *taxra- > Bal (Rakhshani) ta(h)l
Language contact must have played a role in bringing about intense variation, but the exact mechanisms are unclear. Metathesis is generally associated with Persian, since it can be documented in Persian’s history. However, it is not clear whether the presence of metathesis in a non-Persian language is due to wholesale lexical borrowing or lexical diffusion (i.e., the adoption of the pronunciation rC for earlier Cr). Lexical borrowing from Persian tends to be assumed in the literature. For languages with varf : NP barf, it is assumed that the loan is from Middle Persian, or some period predating the change of MP w- to NP b-; for instance, Eilers (1978:749) derives Gazi ‘snow’ from MP varf [sic]. However, this is unlikely to be the case. If we take Judeo-Persian to be representative of the link between Middle and New Persian (cf. MacKenzie 2003), then Judeo-Persian forms like
(Paul 2013:50) make it clear that metathesis postdates the merger of MP w- with b-, and that an intermediate stage *warf was unlikely. Additionally, w-, v-, etc. cannot be secondary from earlier *b- in the forms given above, since most of the languages mentioned show b- for original PIr
This detail aside, there are other reasons to question the account of lexical borrowing from Persian: first, this metathesis may not be a solely Persian development. Since most West Iranian languages (with exceptions, e.g., Yarshater 1962) lost final syllable nuclei, it is likely that many languages had words ending in -xr, -fr, etc., clusters which posed articulatory and perceptual problems, and were resolved in a variety of ways, including metathesis. Second, many of the above forms can be analyzed only as Mischformen, vitiating a lexical borrowing account. Instead, it is possible that speakers in a situation of heavy multilingualism imposed pronunciations from forms in one language upon their cognates in another, a well-documented phenomenon in situations of multidialectalism, generally affecting less frequently uttered words (Phillips 1984, Stollenwerk 1986, Wieling et al. 2011).
4.8 Changes affecting *dr
Gershevitch (1962b:78–9) discusses reflexes of the word for ‘spade’, demonstrating that some modern West Iranian languages reflect a form *barda- (metathesized from *badra-, which is internally derived from *badar-). The source of metathesis in *barda- is unclear. (Schwartz 1971:297–8) shows that Iranian languages continue a doublet in the word for ‘grape’, ), the latter being secondary and a likely East Iranian loan into Persian and other languages. It is not clear whether the metathesis in *barda- is a related phenomenon.
4.9 Prothetic x-, h-
Two separate protheses have operated during the history of Persian. The first involves sporadic insertion of x- before an initial vowel, and predates Middle Persian; the second involves sporadic insertion of h- before an initial vowel, and predates New Persian.
These developments can be seen elsewhere in West Iranian, e.g., ‘duck’ (language unmarked by Asatrian 2012:113)
; Kumzari, Bandari, Larestani
(cf. Bakhtiyari hars, Zazaki hesri). Korn (2005:155–9) provides a detailed treatment of this issue, and makes a strong case that some items showing initial h- in both Balochi and Kurdish are due to contact, though elsewhere, the sporadic presence of h- may be a sort of hypercorrection, as in many English dialects (Wells 1982:252–6), and not necessarily due to wholesale lexical borrowing (further bolstered by the fact that many Iranian languages lose initial h- under varying circumstances, e.g.,
4.10 č ∼ š
Some quasi-systematic variation between is found in forms across West Iranian. In some cases, original
due to the interference of Arabic, which lacks a phoneme
(in the relevant dialects), as in
In other forms, as noted by Horn (1901:71), is secondary, e.g., Zor Yazdi
per’
‘evening’ (1st member
, Bartholomae 1904:553); Kashani
‘unhulled rice’
‘herdsman’, Kurdish
‘butcher’)
(Horn 1901). Martin Schwartz (p.c.) points out that reflexes of the latter etymon may have undergone influence from NP
‘staff, crook’.
4.11 *t > r
The change *t > r in North Tati dialects was noted by Henning (1954:173). This change is seen in other languages, e.g., Judeo-Yazdi ), Judeo-Isfahani
(
, North Bashkardi
. Some Central Dialects show variation between
for ‘milk’, though this may be due to the continuation of separate etyma
4.12 Other developments
Above, a number of developments thought to be of interest to West Iranian dialectology were discussed. In this study, it is not possible to consider all possible meaningful changes, including vowel fronting (Krahnke 1976), variation (e.g., S Tati fercel ‘dirty’ : Bakhtiyari
), and other isoglosses. A hope is that as digitization efforts grow, fully data-driven approaches will allow us to take into account a wider range of innovations (see §9 for details).
4.13 Key Issues
The foregoing sections served to illustrate the difficulties posed for the traditional comparative method by West Iranian sound change. Along the way, some problematic analytical decisions made by scholars have been highlighted, which are restated here:
• Elfenbein (1963) assumes that Marw Balochi burz ‘mace’ is a Persian loan, given unexpected b-, but it could easily be from another language (§4.6)
• Eilers (1978) assumes that Gazi is a loan from Middle Persian *warf, but no such form existed, given the relative chronology between the developments
*-fr > rf; if the metathesis shown by the Gazi form is due to Persian influence, lexical diffusion rather than lexical borrowing was likely involved (§4.7)
• Korn (2005) assumes that PIr in all conditioning environments, andhence, that Balochi
‘turtle, tortoise’, is a loan, but we cannot be sure this is the case (§4.1.6)
It is hoped that the qualitative points made or revived here — namely that some of the segmental and prosodic contextual factors involved in West Iranian sound laws are indeterminate, that not all donor languages are necessarily Persian, and that pure lexical borrowing is not the sole mechanism of contact— are convincing on their own merits. Still, it remains difficult to resolve many of the questions raised above within the constraints of the traditional comparative method. In general, it is difficult to maintain a bird’s-eye view of the many innovations and archaisms that cut across the West Iranian lexicon; while discussing one type of variation, another type is ignored (the above discussion is no exception). The remainder of this paper develops a probabilistic methodology designed to relieve historical linguists of the need to make hard decisions regarding phonological outcomes in a dialectal group, and instead let regularities fall out of the data.
As described above, West Iranian languages show admixture from an unknown number of latent (i.e., unobserved or unknown) dialectal components, each with its own individual sound laws and analogical changes. The key aim of this work is to learn which underlying components have contributed various features to the noisy pattern observed. A number of statistical techniques exist for the purpose of reducing the dimensionality of multivariate categorical data; mixed-membership models of this sort learn clusters that capture co-occurrence patterns of features in a data set in a way that the human eye cannot easily manage to do. These include certain classes of so-called generative models, which attempt to tell a story specifying one or more latent parameters which are thought to have generated the observed data. The latent parameters specified in a generative model can be estimated, usually within a Bayesian framework, which infers their posterior distributions. Bayesian modeling allows prior distributions to be imposed over these parameters, which serves as a sometimes-necessary means of ensuring that the model embodies realistic behavior.
I draw upon probabilistic models of document classification in order to motivate the model I use in this paper. Topic modeling, which seeks to identify the topics present in a set of documents by associating the words found in them with one or more topics, is a well-known application for Bayesian mixed-membership models. Latent Dirichlet Allocation (LDA) is one such model (Blei et al. 2003); it assumes a fixed number of topics. It assumes that there is an overall distribution over possible topics, that each document has a specific distribution over topics, and that each word in each document is distributed according to a particular topic. The posterior global distribution over topics, document-specific topic distributions, and word-specific topic associations can then be inferred; it should be noted that if the procedure is entirely unsupervised, topics will receive meaningless labels such as “Topic 1” rather than “History,” and that these labels require further interpretation. LDA is highly similar to the Structure algorithm of population genetics (Pritchard et al. 2000), which has been used in some linguistic applications (Reesink et al. 2009, Bowern 2012, Longobardi et al. 2013, Syrjänen et al. 2016). Figure 2 provides a hypothetical representation of how inferred topic assignments might appear when LDA is applied to a document classification as well as the data investigated in this paper.
LDA requires practitioners to provide a specification a priori of the total number of topics assumed. It is often unreasonable to assume that an exhaustive list of possible topics has been drawn up. LDA has a non-parametric extension, the Hierarchical Dirichlet
Figure 2: A schematic comparison of topic modeling and the approach used in this paper. In types of topic modeling such as Latent Dirichlet Allocation (LDA), it is assumed that each content word instance in each document in a corpus is generated by a given “topic,” as represented by the shaded circles. These assignments are unknown a priori and must be inferred from the data. The example on the left provides hypothetical posterior topic assignments for a sentence fragment, with individual topics generating word instances from similar semantic fields or spheres of reference. The example on the right provides a hypothetical assignment of dialect components to sound changes operating in words found in a single Iranian speech variety.
Process (HDP, Teh et al. 2005, 2006), which allows for a potentially infinite number of topics. Over the course of the inference procedure, the model will return the number of topics which best explain the data.
I wish to extend the HDP model to the problem of admixture in the vocabularies of Iranian languages. By aggregating the patterns of variation in reflexes of a number of ProtoIranian etyma, we may be able to identify components in the lexicon of each language which conceivably can be explained via historical language contact. I assume that there exists a set of areal components which underlie the variation reflected synchronically in West Iranian languages, and that we can recover their associations with variants and representation within languages.
An advantage of Bayesian models of this sort over classical methods for categorical data analysis is that they are generally robust to uneven or missing data — this is critical, given the patchy coverage for some Iranian languages. At the same time, mixed-membership models can potentially be sensitive to skews in data coverage. If a large number of features bearing on a particular isogloss are well attested in the data, but others are not, the algorithm used to infer component distributions may learn a distribution based on the former, even when the latter are highly relevant (but under-attested).For this reason, I have taken pains to cast a wide net in the selection of features whilst maintaining parity in terms of the number of data points pertaining to each feature.
For the upcoming analysis, words exhibiting the relevant Proto-Iranian sounds and sound sequences were collected from grammars and dictionaries by searching for the relevant semantic field, yielding a dataset of 1229 words. It is acknowledged that this means of data collec-
tion is highly limited, as some languages are better etymologized than others, and it would be preferable to take a top-down approach to data collection using a digitized etymological dictionary or etymological database, when such resources are developed. As mentioned above, the goal here is to tease apart effects of areal contact and conditioning environments within West Iranian. As a concrete example, the presence of b in Sorani Kurdish in Sorani Kurdish
) is due either to contact (e.g., the language has taken the words over from different donor languages) or different conditioning environments in the two words triggering the changes
Information regarding conditioning environments is key to the feature representation which serves as model input. However, explicitly stipulating conditioning environments requires too many assumptions. I use the etymon itself as a proxy for conditioning environments; stating that
in the etymon
is akin to stating that the change is triggered by the following *-r
ysis for a traditional historical grammar; however, any redundancy that this representation entails will be picked up by the model as part of the dimensionality reduction that it carries out. A potential concern is that morphological variants of the same etymon are reflected in the catalogue of features; as mentioned above, different languages may continue different variants of a historical doublet ‘flower’. A similar concern is that of homophony between reconstructed etyma, namely formally identical items that cannot be straightforwardly unified semantically (e.g.,
‘pond, reservoir’ and
). I leave the first problem untreated, with the hope that if a number of morphological variants of a single etymon are reflected in the data, this variation will be detectable in the model’s output, namely via uncertainty in component level sound change distributions concerning this etymon.
I address the second problem by merging formally identical but semantically disparate reconstructions with one another, rather than treating them as instantiating different conditioning environments. For the purposes of the model, each unobserved dialect component has a collection of sound change parameters associated with it. I envision this to be a categorical probability distribution over the possible observed outcomes for each PIr sound of interest in each etymon (our proxy for the conditioning environment). These parameters can be visualized as shown in Table 2, for a given dialect component (probabilities are hypothetical): Under the Neogrammarian hypothesis, sound change is exceptionless (Osthoff and Brugmann 1879, Bloomfield 1933, Hoenigswald 1965, Davies 1978). The probability of a sound change
Table 2: Hypothetical sound change probabilities for a latent dialect component. Note that probabilities of outcomes for the relevant PIr sound(s) sum to one, and that distributions are sparse (with the majority of mass concentrated on one outcome).
operating in a given speech variety is strictly categorical: one outcome will occur with 100% probability, all others with 0% probability. This paper’s model relaxes the Neogrammarian hypothesis, allowing sound change probabilities to be non-categorical. The first purpose is practical: rigid categorical-valued variables which assign zero, rather than infinitesimal probability mass to an outcome, will cause problems for the inference procedure, and enumerating all possible combinations of categorical feature states is computationally unfeasible. The second pertains to the real world, namely, to account for irregularity within a component that cannot be explained (due to analogy, so-called “sporadic” change, or some other mechanism). However, it is still ideal to constrain these probability distributions such that they are sparse, with the majority of mass concentrated on one outcome, rather than smooth (i.e., with mass distributed quasi-uniformly across outcomes). Ultimately, while we cannot constrain the model to enforce regular sound change, we can employ priors that regularize sound change, encouraging probabilities to be very close to either 0 or 1.
For the purposes of this study, I make no attempt to model intermediate stages in sound change. For instance, it is not entirely clear whether the f- in Sivandi comes from an intermediate
, or directly from
(though the latter scenario is more likely, as such changes are better attested in Sivandi). Techniques have been proposed for reconstructing forms at intermediate nodes on fixed phylogenies (Bouchard-Côté et al. 2007, 2013), but not for situations like ours, where a form in a given language is generated by one of an unknown number of dialect components, rather than a single fixed ancestor.
relatively abstract model of feature representation employed at least partly ensures that the sound changes dealt with by the model are meaningful. This paper’s data set comprises 1160 sound change instances instantiating 190 unique sound change types in 32 West Iranian languages.
The generative process underlying the HDP and the technical details of inference can be found in the appendix. A non-technical description of the HDP follows. Each data point (i.e., the reflex of a Proto-Iranian sound in a particular etymon in a given language, e.g., PIr ) is associated with a latent dialect component. The probability that a data point is associated with a given latent dialect component is dependent on a language-level probability distribution over dialect components
, as well as a componentlevel distribution over sound changes
. We do not know the values of these parameters, and must infer parameter values of high posterior probability (i.e., of high likelihood as well as high prior probability) from the data. Additionally, we do not know the true number of dialect components; this unknown must be learned by the model as well.
The HDP involves three hyperparameters: is the concentration parameter of the symmetric Dirichlet prior over each dialect component’s sound change distribution; the parameter
controls the dispersion of data points across dialect components within a given language;
controls the number of components inferred (at the risk of oversimplifying). These hyperparameters can be fixed, or (as in the case of the parameters described in the previous paragraph) given a fully Bayesian treatment by estimating them from the data.
Parameter and hyperparameter values can be estimated in several ways, including Markov chain Monte Carlo (MCMC) approaches such as Gibbs Sampling (Geman and Geman 1984) or Variational Bayesian methods (Bishop 2006). In the former procedure, values for each parameter are sampled stochastically on the basis of current values of all other parameters; after many iterations, the Gibbs sampler is guaranteed to draw samples from the posterior distribution of each parameter. Variational methods can be either deterministic or stochastic, and unlike MCMC methods, they assume a parametric form of the posterior distribution of each variable known as the variational posterior distribution, the parameters of which are iteratively updated. I use Automatic Differentiation Variation Inference (ADVI, Kucukelbir et al. 2017), as implemented in PyMC3 (Salvatier et al. 2016) to infer the posterior distributions of (as described in the Appendix).
As stated in the previous section, the inference procedure finds posterior probability distributions for two key parameters: , which gives each language’s posterior distribution over dialect components;
, which gives each dialect component’s distribution over sound changes.
8.1 Language-level component distributions
As is clear from Figure 3, most languages in the sample show a relatively uniform profile in terms of their component makeup, favoring a small number of identical components. This pattern dovetails with received wisdom regarding the widespread dominance of Persian over
Figure 3: Language-level posterior distributions over latent dialect components
other West Iranian languages in the period following the Safavid empire roughly 500 years before the present day (Borjian 2009); this homogenization appears to have resulted in a more or less uniform profile for New West Iranian languages in terms of the sound changes reflected in their vocabularies (albeit with some degree of differentiation).
Virtually all languages in the sample show some degree of admixture from component k = 1, along with differing degrees of components . Interestingly, k = 1 appears to be strongly associated with developments that are thought to be typical of NW Iranian, such as the retention of initial
and the non-operation of the change
. It is not surprising that Modern Persian attests this component to a strong degree, given the well-known NW Iranian component in its vocabulary; at the same time, this proportion is higher than expected, as certain instances of SW Iranian behavior are strongly associated with this component (e.g.,
receiving higher posterior probability in components
While visualizing gives us an overview of the predominant components present in a language’s vocabulary, the picture presented is difficult to interpret in that it does not allow us to pinpoint exactly why these components are present, in terms of the reflexes to which they are linked. To gain a closer understanding of this issue, it is instructive to inspect the posterior probabilities of component membership for individual sound change instances. I
describe these issues in detail in the upcoming section.
8.2 Posterior distributions over components for sound change instances
I use the MAP values of to reconstruct the posterior probability distribution over component membership for each individual token with index i — i.e., each sound change instance in each language — in the data set,
. These probability distributions are given in the Appendix, as well as a table summarizing these values by averaging them across instances for each sound change type. These values allow us to address hypotheses about the provenance of certain sound changes (such as those discussed in §4.13). Many of these distributions exhibit high uncertainty or entropy, with probability mass spread out across more than one dialect group rather than concentrated on a single group; this is perhaps a consequence of the relatively small size of the data set used in this study. At first glance, this uncertainty may seem to make the results difficult to interpret, but on the contrary these results are quite interpretable in that this uncertainty is relatively informative. Consider the following posterior distributions, concerning reflexes of Proto-Iranian *br
‘high’ and
which show the posterior probability of a sound change type given a dialect component (I exclude components where none of the probabilities exceed .05 for visual clarity).
Tokens exhibiting the change (our shorthand for forms such as burz, which do not undergo change to l) are associated strongly with a single latent dialect component, k = 1, as are tokens exhibiting the change
. Tokens exhibiting the changes
do not show a particularly strong affinity with any latent dialect component. What is critical here is that changes of the former type, usually associated with Northwest Iranian languages, show behavior that patterns much differently from changes usually associated with Southwest Iranian. This allows us to potentially classify individual change types according to whether the posterior distributions they exhibit are more in line with prototypical Northwest Iranian or Southwest Iranian sound changes.
On the basis of these distributions, I propose provisional solutions for the problems identified in §4.13. We find that Elfenbein’s (1963) identification of Marw Balochi burz as a Southwest Iranian loan is indeed highly probable. Table 3 shows the component distributions of changes affecting PIr ‘mace, club’. We see that change to b- shows a distribution similar to those of the prototypically Southwest Iranian sound changes discussed above, while change to
shows Northwest Iranian behavior. Similarly, we find that change to
‘turtle/tortoise’ patterns with canonically Northwest Iranian changes; hence, there is no strong reason to consider Balochi
a loan, as assumed by Korn (2005), since it patterns with many other typically Balochi features. Finally, changes concerning the etymon
‘snow’ suggest a Northwest Iranian origin for the presence of w- and a Southwest Iranian origin for metathesis in the form; hence, Gazi
is probably a
Table 3: Posterior component distributions for selected sound changes (I exclude components with probability mass under .05 for visual clarity)
genuine Mischform, pace Eilers (1978), stemming perhaps from a scenario where speakers in contact with a neighboring dialect exhibiting metathesis imposed this sound change on their inherited reflex of The results from the model are by no means the final word on these issues, and it is to be stressed that the conclusions drawn above are only tentative. It is likely that in many cases of idiosyncratic or unusual behavior, the paucity of data employed is the culprit. I have demonstrated however that this sort of methodology serves as a promising technique for teasing apart questions concerning dialectal admixture in Iranian and other dialect groups. I am confident that this method will produce increasingly realistic and reliable results as digital resources for Iranian languages grow, facilitating big data approaches to questions such as those addressed in this paper.
In this paper, I outlined a series of unresolved problems in Iranian dialectology and developed a probabilistic methodology designed to address these problems. In doing this, for the most part, I sought proof of concept as to whether Bayesian applications to Iranian dialectology might yield results which shed light on outstanding problems in the field as well as those that jibe with received wisdom. To some extent, this exercise was a success: I have shown that this model has great potential for resolving questions of the sort asked in this paper, but will benefit from further refinement. Below, I identify future directions that will improve this line of research:
9.1 Data
This paper made use of a relatively small data set compiled by hand from existing grammars. Sound changes were manually coded according to the behavior they displayed. Additionally, only sound changes thought to be of interest to West Iranian dialectology were included in the feature catalog. While I do not feel that this method of feature selection introduced any sort of pernicious bias that negatively affected results — after all, this paper focused on patterns displayed by sound changes thought to be probative for the purposes of Iranian dialect grouping across the vocabularies of West Iranian languages — it may be desirable to employ a more hands-off approach to feature selection and extraction, which will necessitate larger digitized etymological data sets. Additionally, this paper excluded East Iranian languages (including the languages Ormuri and Parachi), and shared patterns across both East and West Iranian should not be neglected; again, fulfilling this desideratum requires bigger data. At least two tacks can be taken for the purpose of data expansion: the first would involve digitizing of existing etymological dictionaries (Cheung 2007, Rastorgueva and ˙Edel’man 2003) and converting them into a computationally tractable data format; however, no complete Iranian etymological dictionary currently exists for all parts of the lexicon, though current efforts such as the Atlas of the Languages of Iran (Anonby et al. 2019), in its pilot phase at the time of writing, work towards filling this gap. The second approach involves applying semi-supervised cognate detection methods (List 2012, Rama 2016) to digitized Iranian word lists, which can potentially be coupled with semi-supervised methodologies for linguistic reconstruction (Meloni et al. 2019). While these methods still face many challenges, they can potentially save specialists a great deal of time and work in compiling large etymological resources. Whatever the approach employed, I believe that methods of the sort introduced in this paper will greatly benefit from the use of a larger data set. It is possible that the use of different data may yield different results from those reported in this paper.
9.2 Models
While this paper employed the HDP, several alternative types of nonparametric mixed-membership model exist. The HDP has certain properties that are undesirable for certain uses, possibly including the dialectological application explored in this paper: specifically, the proportion of a component across all data points is correlated with its proportion within languages. It may be the case that a certain component is very rare overall, but well represented within one or a small number of languages. Certain alternatives to the HDP deal explicitly with this issue (Williamson et al. 2010).
9.3 Representation of sound change
In designing this paper’s methodology, I made the radical decision to make no prior assumptions about the nature of the conditioning environments involved in the sound changes under study, instead treating entire etyma as conditioning environments. At first blush, this may seem like an implementation of the dictum that every word has its own history, attributed to dialect geographers such as Jules Gilliéron and Hugo Schuchardt. This is not the case: by linking the diachronic behavior of Proto-Iranian sounds in individual etyma to a finite number of dialect components exhibiting regularized sound change, we have inferred information regarding patterns of sound change within components as well as patterns of admixture within languages; the model ultimately embodies the interpretation of the above problem posed by dialect geographers that was provided by Bloomfield (1933:360).
At the same time, it may be wrong to ignore the effect of phonetic similarity between conditioning environments on sound change. It may be the case that in a particular dialect component, undergoes a particular type of change in similar-looking etyma like
and
, but a different change in a more dissimilar etymon such as
. I have ignored this possibility; my goal was to let this systematicity fall out of the data in a bottom-up fashion. If desired, it is possible to employ a prior over sound change that can express covariance, such as the logistic normal distribution, which will encourage Proto-Iranian sounds to behave similarly in phonetically similar environments (which can potentially be operationalized via a smooth kernel function of the edit distance between the etyma containing these environments).
This paper introduced a new way of looking at Iranian dialectal relationships. The focus was on sound change in West Iranian, but this method can potentially be extended to linguistic groups of similar geographic spread and time depth. My chief goal was to provide a means for relaxing assumptions regarding the operation of individual sound changes in individual languages, and allow regular patterns to fall out of the data. Much work remains to be done in order to understand the complex history of the Iranian languages. Larger data resources are needed, and cooperation among linguists is needed in order to design and refine the probabilistic models we use; as data analysts, we need to work together to characterize the stochastic processes that we believe to have generated the data we observe, formalized in probabilistic terms. There needs to be a willingness to simplify models (if particular models are intractable), and an effort to keep models flexible, so that they can be expanded. It is likely that many of these goals are well within reach.
All errors and infelicities are my own responsibility. Many of the issues treated in this paper are inspired by discussions with Martin Schwartz. I am additionally grateful for comments and suggestions provided by Tim Aufderheide, Florian Wandl, two anonymous referees, and editor James Clackson, as well as audiences at the Universities of Zurich and Tübingen.
Anonby, E. and A. Asadi (2014). Bakhtiari studies: phonology, text, lexicon. Uppsala: Acta Universitatis Upsaliensis.
Anonby, E., M. Taheri-Ardali, and A. Hayes (2019). The Atlas of the Languages of Iran (ALI): A research overview. Iranian Studies 52, 199–230.
Asatrian, G. (2002). The Lord of Cattle in Gilan. Iran and the Caucasus 6, 75–85.
Asatrian, G. (2012). Marginal remarks on the history of some Persian words. Iran and the Caucasus 16, 105–116.
Authier, G. (2012). Grammaire juhuri, ou judeo-tat, langue iranienne des Juifs du Caucase de l’est. Beitrage Zur Iranistik. Dr. Ludwig Reichert Verlag.
Azami, C. A. and G. Windfuhr (1972). A Dictionary of Sangesari with a Grammatical Outline. Tehran: Franklin Book Programs.
Back, M. (1978). Die sassanidischen Staatsinschriften. Leiden: Brill.
Baghbidi, H. R. (1383 [2005]). Guyeš-i vidari. (19), 18–26.
Bailey, H. W. (1933). Western Iranian dialects. Transactions of the Philological Society, 46–64.
Bailey, H. W. (1973). Mleccha, Bal¯oč, and Gadr¯osia. Bulletin of the School of Oriental and African Studies 36(3), 584–587.
Barker, M. A. (1969). A Course in Baluchi. Montreal: Institute of Islamic Studies, McGill University.
Bartholomae, C. (1883). Handbuch der altiranischen Dialekte (Kurzgefasste vergleichende Grammatik, Lesestücke und Glossar). Leipzig: Breitkopf & Härtel.
Bartholomae, C. (1904). . Strassburg [Strasbourg]: Karl J. Trübner.
Beekes, R. S. P. (1997). Historical phonology of Iranian. Journal of Indo-European Studies 25(1-2), 1–26.
Benedictsen, Å. M. and A. Christensen (1921). ume 6 of Historisk-filosofiske Meddelelser. Copenhagen: Det Kgl. Danske Videnskabernes Selskab.
Benveniste, E. (1935). Les infinitifs avestiques. Paris: Adrien Maisonneuve.
Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin & New York: Springer.
Blau, J. (1980). Manuel de Kurde (dialecte Sorani). Paris: Klincksieck.
Blei, D. M., A. Y. Ng, and M. I. Jordan (2003). Latent Dirichlet allocation. Journal of Machine Learning research 3, 993–1022.
Bloomfield, L. (1933). Language. New York: Holt, Rinehart and Winston.
Borjian, H. (2009). Median succumbs to Persian after three millennia of coexistence: Lan- guage shift in the Central Iranian Plateau. Journal of Persianate Studies 2, 62–87.
Borjian, H. (2020). The Perside language of Shiraz Jewry: A historical-comparative phonol- ogy. Iranian Studies 53(3-4), 403–415.
Bouchard-Côté, A., D. Hall, T. L. Griffiths, and D. Klein (2013). Automated reconstruc- tion of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences 110, 4224–4229.
Bouchard-Côté, A., P. Liang, T. Griffiths, and D. Klein (2007). A probabilistic approach to diachronic phonology. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL), Prague, pp. 887–896. Association for Computational Linguistics.
Bowern, C. (2012). The riddle of Tasmanian languages. Proceedings of the Royal Society B: Biological Sciences 279(1747), 4590–4595.
Brust, M. (2018). Historische Laut- und Formenlehre des Altpersischen: mit einem etymologischen Glossar. Innsbruck: Institut für Sprachwissenschaft der Universität Innsbruck.
Cantera, A. (2009). On the history of the Middle Persian nominal inflection. In W. Sunder- mann, A. Hintze, and F. de Blois (Eds.), Exegisti Monumenta: Festschrift in Honour of Nicholas Sims-Williams, Volume 17 of Iranica, pp. 17–30. Wiesbaden: Harrassowitz.
Cathcart, C. (2015). Iranian dialectology and dialectometry. Ph. D. thesis, University of California, Berkeley.
Cheung, J. (2007). Etymological dictionary of the Iranian verb. Leiden: Brill.
Chyet, M. L. (2003). with selected etymologies by Martin Schwartz. New Haven/London: Yale University Press.
Davies, A. M. (1978). Analogy, segmentation and the early Neogrammarians. Transactions of the Philological Society 76, 36–60.
Durkin-Meisterernst, D. (2004). Dictionary of Manichaean Texts III. Turnhout: Brepols.
Durkin-Meisterernst, D. (2014). Grammatik des Westmitteliranischen (Parthisch und Mittelpersisch). Wien: Verlag der Österreichischen Akademie der Wissenschaften.
Efimov, V. A. (1986). Jazyk ormuri: v sinxronnom i istoričeskom osveščenii. Moscow: Nauka.
Eilers, W. (1976). Westiranische Mundarten aus der Sammlung Wilhelm Eilers. Vol. 1: Die Mundart von Chunsar. Wiesbaden: Steiner. With assistance from Ulrich Schapka.
Eilers, W. (1978). Westiranische Mundarten aus der Sammlung Wilhelm Eilers. Vol. 2: Die Mundart von Gäz. Wiesbaden: Steiner. With assistance from Ulrich Schapka.
Elfenbein, J. (1963). A vocabulary of Marw Baluchi. Naples: Istituto Universitario Orientale di Napoli.
Emmerick, R. E. (1992). Iranian. In J. Gvozdanović (Ed.), Indo-European Numerals, pp. 289–346. Berlin & New York: Mouton de Gruyter.
Geiger, W. (1901). Kleinere Dialekte und Dialektgruppen. In W. Geiger and E. Kuhn (Eds.), Grundriss der iranischen Philologie, Volume 1, Chapter 8, pp. 287–423. Strassburg [Strasbourg]: Karl J. Trübner.
Geman, S. and D. Geman (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721–741.
Gershevitch, I. (1952). Ancient survivals in Ossetic. Bulletin of the School of Oriental and African Studies 14(3), 483–495.
Gershevitch, I. (1954). A Grammar of Manichaean Sogdian. Oxford: Blackwell.
Gershevitch, I. (1962a). Dialect variation in early Persian. Transactions of the Philological Society 63, 1–29.
Gershevitch, I. (1962b). Outdoor terms in Iranian. In W. B. Henning and E. Yarshater (Eds.), , pp. 76–84. London: Percy Lund, Humphries & Co.
Gharib, B. (1995). Sogdian Dictionary: Sogdian-Persian-English. Tehran: Farhangan Publications.
Grierson, G. A. (1918). The ¯Ormur.¯ı or Bargist¯a language, an account of a little-known Eranian dialect. Memoirs of the Asiatic Society of Bengal 7, 1–101.
Hadank, K. (1930). Kurdisch-persische Forschungen, Abt. 3 (Nordwestiranisch) Bd. 2, Mundarten der Gûrân besonders das Kändûläî, Auramânî und Bâdschälânî. Berlin: Walter de Gruyter.
Hammarström, H., R. Forkel, and M. Haspelmath (2017). Glottolog 3.3. Max Planck Institute for the Science of Human History.
Henning, W. B. (1954). The ancient language of Azerbaijan. Transactions of the philological society, 157–177.
Henning, W. B. (1963). The Kurdish elm. Asia Major 10(1), 68–72.
Hoenigswald, H. M. (1965). Language change and linguistic reconstruction. Chicago: University of Chicago Press.
Hoffmann, K. (1976). Zur altpersischen Schrift. In J. Narten (Ed.), zur Indoiranistik, Volume 2, pp. 620–645. Wiesbaden: Ludwig Reichert Verlag.
Hoffmann, K. and B. Forssman (2004). Avestische Laut- und Flexionslehre (2nd ed.), Volume 84 of . Innsbruck: Institut für Sprachwissenschaft der Universität Innsbruck.
Horn, P. (1893). Grundriss der neupersischen Etymologie. Strassburg [Strasbourg]: Karl J. Trübner.
Horn, P. (1901). Neupersische Schriftsprache. In W. Geiger and E. Kuhn (Eds.), Grundriss der iranischen Philologie, Volume 1, pp. 1–200. Strassburg [Strasbourg]: Karl J. Trübner.
Hübschmann, H. (1895). Persische Studien. Strassburg [Strasbourg]: Karl J. Trübner.
Humbach, H. and P. Ichaporia (1998). Zamy¯ad Yasht: Yasht 19 of the Younger Avesta: text, translation, commentary. Wiesbaden: Harrassowitz.
Ishwaran, H. and L. F. James (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association 96(453), 161–173.
Ivanow, W. (1940). The Gabri dialect spoken by the Zoroastrians of Persia, Volume 16 of Rivista degli Studi Orientali. Rome: Scuola Orientale nella R. Università di Roma.
Kamioka, K. and M. Yamada (1979). , Volume 1. Tokyo: Institute for the Study of Cultures of Asia and Africa.
Kent, R. (1942). Vocalic r in Old Persian before n. Language 18(2), 79–82.
Kent, R. (1951). Old Persian, Volume 33 of American Oriental Series. New Haven: American Oriental Society.
Kieffer, C. (1989). Le par¯ač¯ı, l’¯ormur.¯ı. In R. Schmitt (Ed.), Compendium Linguarum Iranicarum, pp. 445–455. Wiesbaden: Ludwig Reichert Verlag.
Kingma, D. P. and J. Ba (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
Klingenschmitt, G. (2000). Mittelpersisch. In B. Forssman and R. Plath (Eds.), Indoarisch, Iranisch und die Indogermanistik: Arbeitstagung der Indogermanischen Gesellschaft vom 2. bis 5. Oktober 1997 in Erlangen, pp. 191–229. Wiesbaden: Ludwig Reichert Verlag.
Korn, A. (2003). Balochi and the concept of North-Western Iranian. In C. Jahani and A. Korn (Eds.), The Baloch and their neighbours: ethnic and linguistic contacts in Balochistan in historical and modern times, pp. 49–60. Wiesbaden: Dr. Ludwig Reichert Verlag.
Korn, A. (2005). Towards a Historical Grammar of Balochi. Wiesbaden: Ludwig Reichert Verlag.
Korn, A. (2016). A partial tree of Central Iranian. Indogermanische Forschungen 121, 401–434.
Korn, A. (2019). Isoglosses and subdivisions of Iranian. Journal of Historical Linguistics 9(2), 239–281.
Krahnke, K. (1976). Linguistic Relationships in Central Iran. Ph. D. thesis, University of Michigan.
Kucukelbir, A., D. Tran, R. Ranganath, A. Gelman, and D. M. Blei (2017). Automatic differentiation variational inference. The Journal of Machine Learning Research 18(1), 430–474.
Kümmel, M. (2007). Konsonantenwandel. Wiesbaden: Dr. Ludwig Reichert Verlag.
Lecoq, P. (1979). Le dialecte de Sivand, Volume 10 of . Wiesbaden: Dr. Ludwig Reichert Verlag.
Lentz, W. (1926). Die nordiranischen Elemente in der neupersischen Literatursprache bei Firdosi.
Lipp, R. (2009). Die indogermanischen und einzelsprachlichen Palatale im Indoiranischen. Heidelberg: Carl Winter. 2 vols.
List, J.-M. (2012). Lexstat: Automatic detection of cognates in multilingual wordlists. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pp. 117–125. Association for Computational Linguistics.
Longobardi, G., C. Guardiano, G. Silvestri, A. Boattini, and A. Ceolin (2013). Toward a syntactic phylogeny of modern Indo-European languages. Journal of Historical Linguistics 3(1), 122–152.
Lubotsky, A. (2001). The Indo-Iranian substratum. In C. Carpelan, A. Parpola, and P. Koskikallio (Eds.), Early Contacts between Uralic and Indo-European: Linguistic and Archaeological Considerations. Papers presented at an international symposium held at the Tvärminne Research Station of the University of Helsinki 8-10 January 1999, Helsinki, pp. 301–317.
Lubotsky, A. (2002). Scythian elements in Old Iranian. In N. Sims-Williams (Ed.), IndoIranian Languages and Peoples, Volume 116 of Proceedings of the British Academy, Oxford, pp. 189–202. Oxford University Press.
MacKenzie, D. N. (1961). The origins of Kurdish. Transactions of the Philological Society, 68–86.
MacKenzie, D. N. (1971). A Concise Pahlavi Dictionary. London: Oxford University Press.
MacKenzie, D. N. (2003). The missing link. In L. Paul (Ed.), Persian Origins: Early JudaeoPersian and the Emergence of New Persian, pp. 103–110. Wiesbaden: Harrassowitz.
Mann, O. (1909). Kurdisch-Persische Forschungen, Abt. 1: Die Tâjîk-Mundarten der Provinz Fârs. Berlin: Reimer.
Mann, O. and K. Hadank (1906-1932). Kurdisch-persische Forschungen. Berlin: Walter de Gruyter.
Mayrhofer, M. (1992). , Volume 1. Heidelberg: Winter.
Meloni, C., S. Ravfogel, and Y. Goldberg (2019). Ab antiquo: Proto-language reconstruction with RNNs. arXiv preprint arXiv:1908.02477.
Miller, V. F. (1892). . St. Petersburg: [Tip. Imp. Akademij Nauk].
Monchi-Zadeh, D. (1990). , Volume 15 of Acta Iranica. Leiden: Brill.
Morgenstierne, G. (1926). Report on a linguistic mission to Afghanistan. Oslo: H. Aschehoug & Co.
Morgenstierne, G. (1929). Parachi and Ormuri, Volume 1 of Indo-Iranian Frontier Languages. Oslo: Instituttet for Sammenlignende Kulturforskning, H. Aschehoug & Co. (W. Nygaard).
Morgenstierne, G. (1932). Persian etymologies. Norsk Tidsskrift for Sprogvidenskap 5, 54–56.
Morgenstierne, G. (1960). Stray notes on Persian dialects ii. Norsk Tidsskrift for Sprogvidenskab 19, 121–129.
Nawata, T. (1984). Mazandarani, Volume 17 of Asian and African Grammatical Manual. Tokyo: Institute for the Study of Languages and Cultures of Asia and Africa, Tokyo University of Foreign Studies.
Oranskij, I. M. (1963 [1977]). Les Langues Iraniennes. Translated by Joyce Blau. Paris: Klincksieck.
Osthoff, H. and K. Brugmann (1879). Morphologische Untersuchungen auf dem Gebiet der indogermanischen Sprachen, Volume 2. Leipzig: Hirzel.
Paul, D. (2011). A comparative dialectal description of Iranian Taleshi. Ph. D. thesis, University of Manchester.
Paul, L. (1998a). The position of Zazaki among West Iranian languages. In N. Sims-Williams (Ed.), Proceedings of the Third European Conference of Iranian Studies held in Cambridge, 11th to 15th September 1995. Part I: Old and Middle Iranian Studies, Wiesbaden, pp. 163– 177. European Conference of Iranian Studies: Dr. Ludwig Reichert Verlag.
Paul, L. (1998b). Zazaki: Grammatik und Versuch einer Dialektologie, Volume 18 of Beiträge zur Iranistik. Wiesbaden: Dr. Ludwig Reichert Verlag.
Paul, L. (2005). The language of the in historical and dialectal perspective. In D. Weber (Ed.), Languages of Iran: Past and Present: Iranian Studies in Memoriam David Neil MacKenzie, pp. 141–151. Wiesbaden: Dr. Ludwig Reichert Verlag.
Paul, L. (2013). A Grammar of Early Judaeo-Persian. Wiesbaden: Ludwig Reichert Verlag.
Peeters, P. (1910). S. Eleutherios-Guhištazad. Analecta Bollandiana 29, 151–156.
Pelevin, M. (2010). Materials on the Bandari dialect. Iran and the Caucasus 14, 57–78.
Phillips, B. S. (1984). Word frequency and the actuation of sound change. Language 60, 320–342.
Pisowicz, A. (1985). Origins of the new and middle Persian phonological systems. Kraków: Uniwersytet Jagielloński.
Pritchard, J. K., M. Stephens, and P. Donnelly (2000). Inference of population structure using multilocus genotype data. Genetics 155(2), 945–959.
Rama, T. (2016, December). Siamese convolutional networks for cognate identification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 1018–1027. The COLING 2016 Organizing Committee.
Rastorgueva, V. S. and D. I. ˙Edel’man (2000-2003). Moscow: Vostočnaja Literatura.
Rastorgueva, V. S., A. A. Kerimova, A. K. Mamedzade, L. A. Pireiko, and J. I. Edelman (2012). The Gilaki Language. Uppsala: Uppsala Universitet.
Reesink, G., R. Singer, and M. Dunn (2009). Explaining the linguistic diversity of Sahul using population models. PLoS Biology 7, e1000241.
Rzymski, C., T. Tresoldi, S. J. Greenhill, M.-S. Wu, N. E. Schweikhard, M. Koptjevskaja- Tamm, V. Gast, T. A. Bodt, A. Hantgan, G. A. Kaiping, S. Chang, Y. Lai, N. Morozova, H. Arjava, N. Hübler, E. Koile, S. Pepper, M. Proos, B. Van Epps, I. Blanco, C. Hundt, S. Monakhov, K. Pianykh, S. Ramesh, R. D. Gray, R. Forkel, and J.-M. List (2020). The database of cross-linguistic colexifications, reproducible analysis of cross-linguistic polysemies. Scientific Data 7(1), 13.
Salemann, C. (1901). Mittelpersisch. In W. Geiger and E. Kuhn (Eds.), Grundriss der iranischen Philologie, Volume 1, Chapter 3, pp. 249–332. Strassburg [Strasbourg]: Karl J. Trübner.
Salvatier, J., T. V. Wiecki, and C. Fonnesbeck (2016). Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2, e55.
Schapka, U. (1972). Die persischen Vogelnamen. Ph. D. thesis, University of Würzburg.
Schmitt, R. (1989). Altpersisch. In R. Schmitt (Ed.), Compendium Linguarum Iranicarum, pp. 56–85. Wiesbaden: Ludwig Reichert Verlag.
Schmitt, R. (2009). Die altpersischen Inschriften der Achaimeniden. Wiesbaden: Ludwig Reichert Verlag.
Schulze, W. (2000). Northern Talysh. Munich: Lincom.
Schwartz, M. (1970 [1971]). On the Khwarezmian version of the muqaddimat al-adab as edited by Johannes Benzing. 288–304.
Schwartz, M. (1982). “Blood” in Sogdian and Old Iranian. In Monumentum Georg Morgenstierne II, pp. 189–196. Leiden: Brill.
Schwartz, M. (2006). On Haoma, and its liturgy in the Gathas. In A. Panaino and A. Piras (Eds.), , Volume 1, Milan, pp. 215–224. Mimesis.
Schwartz, M. (2008). Iranian *l, and some Persian and Zaza etymologies. Iran and the Caucasus 12, 281–287.
Schwartz, M. (2010). On Rashnu’s scales and the Chinvant’s bridge, with etymological appendices. Studia Asiatica 11(1-2), 99–104.
Schwarzschild, L. A. (1960). Review of Pugliese Carratelli and G. Levi Della Vida. Journal of the American Oriental Society 80, 155–157.
Shringarpure, S. and E. P. Xing (2009). mStruct: inference of population structure in light of both genetic admixing and allele mutations. Genetics 182(2), 575–593.
Sims-Williams, N. (1989). Eastern Middle Iranian. In R. Schmitt (Ed.), Compendium Linguarum Iranicarum, pp. 165–172. Wiesbaden: Dr. Ludwig Reichert Verlag.
Sims-Williams, N. (1996). Eastern Iranian languages.
Skjærvø, P. O. (1983). Farnah-: mot mède en vieux-perse? Linguistique de Paris 78, 241–259.
Skjærvø, P. O. (1988). Baškardi. (8), 846–850.
Skjærvø, P. O. (1989). Pashto. In R. Schmitt (Ed.), Compendium Linguarum Iranicarum, pp. 384–410. Wiesbaden: Ludwig Reichert Verlag.
Skjærvø, P. O. (2009). Old Iranian. In G. Windfuhr (Ed.), The Iranian languages, pp. 43–195. London: Routledge.
Soane, E. B. (1913). Grammar of the Kurmanji or Kurdish language. London: Luzac.
Steingass, F. J. (1892). A Comprehensive Persian-English dictionary, including the Arabic words and phrases to be met with in Persian literature. London: Routledge & K. Paul.
Stilo, D. (1981). The Tati Language Group in the Sociolinguistic Context of Northwestern Iran and Transcaucasia. Iranian Studies 14(3/4), 137–187.
Stilo, D. (2004). Vafsi Folk Tales. Wiesbaden: Dr. Ludwig Reichert Verlag.
Stilo, D. (2005). Iranian as a buffer zone between the universal typologies of Turkic and Semitic. In E. Csató, B. Isaksson, and C. Jahani (Eds.), Linguistic Convergence and Areal Diffusion. Case Studies from Iranian, Semitic and Turkic, pp. 35–63. Routledge.
Stilo, D. (2007). Isfahan xix. Jewish dialect. (1), 77–84.
Stilo, D. (2018). Numeral classifier systems in the Araxes-Iran linguistic area. In W. B. McGregor and S. Wichmann (Eds.), The Diachrony of Classification Systems, pp. 135– 164. Amsterdam: Benjamins.
Stollenwerk, D. A. (1986). Word frequency and dialect borrowing. Ohio State University Working Papers in Linguistics 34, 133–141.
Syrjänen, K., T. Honkola, J. Lehtinen, A. Leino, and O. Vesakoski (2016). Applying pop- ulation genetic approaches within languages: Finnish dialects as linguistic populations. Language Dynamics and Change 6, 235–283.
Tafazzoli, A. (1974). Pahlavica II. Acta Orientalia 36, 113–23.
Tavernier, J. (2007). Iranica in the Achaemenid period (ca. 550-330 B.C.): lexicon of old Iranian proper names and loanwords. Leuven: Peeters.
Tedesco, P. (1921). Dialektologie der westiranischen Turfantexte. Le Monde Oriental 15, 184–258.
Teh, Y. W., M. I. Jordan, M. J. Beal, and D. M. Blei (2005). Sharing clusters among related groups: Hierarchical Dirichlet processes. In Advances in Neural Information Processing Systems, pp. 1385–1392.
Teh, Y. W., M. I. Jordan, M. J. Beal, and D. M. Blei (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association 101, 1566–1581.
Thackston, W. M. (n.d.). Kurmanji Kurdish. a reference grammar with selected readings.
Thomas, B. (1930). The Kumzari dialect of the Shihuh tribe, Arabia and a vocabulary. London: The Royal Asiatic society.
Vahman, F. and G. Asatrian (2002). Notes on the Language and Ethnography of the Zoroastrians of Yazd, Volume 85 of Historisk-filosofiske Meddelelser. Copenhagen: The Royal Danish Academy of Sciences and Letters.
van der Wal Anonby, C. (2015). A grammar of Kumzari: A mixed Perso-Arabian language of Oman. Ph. D. thesis, Rijksuniversiteit te Leiden.
Žukovskij, V. A. (1888-1922). . St. Petersburg: [n.s.] 3 vols.
Weiss, M. (2009). Outline of the Historical and Comparative Grammar of Latin. Ann Arbor: Beech Stave Press.
Wells, J. (1982). Accents of English 1: an introduction. Cambridge: Cambridge University Press.
Wendtland, A. (2009). The position of the Pamir languages within East Iranian. Orientalia Suecana 58, 172–188.
Wieling, M., J. Nerbonne, and R. H. Baayen (2011). Quantitative social dialectology: Ex- plaining linguistic variation geographically and socially. PloS one 6(9), e23613.
Williamson, S., C. Wang, K. A. Heller, and D. M. Blei (2010). The IBP compound dirichlet process and its application to focused topic modeling. In Proceedings of the 27th International Conference on Machine Learning. Haifa, Israel.
Windfuhr, G. (1991). Central dialects.
Windfuhr, G. (2009). Dialectology and topics. In G. Windfuhr (Ed.), The Iranian languages, Chapter 2, pp. 5–42. New York; London: Routledge.
Yar-shater, E. (1969). A Grammar of Southern Tati Dialects. Number 1 in Median Dialect Studies. The Hague: Mouton.
Yarshater, E. (1962). The Tati dialects of Ramand. In W. B. Henning and E. Yarshater (Eds.), , pp. 240–245. London: Percy Lund, Humphries & Co.
Zehnder, T. (1999). Atharvaveda-Paippal¯ada, Buch 2, Text, Übersetzung, Kommentar: eine Sammlung altindischer Zaubersprüche vom Beginn des 1. Jahrtausends v. Chr. Idstein: Schulz-Kirchner.
Model specification and inference
The generative process for the HDP involving the truncated stick-breaking construction (Ishwaran and James 2001) is given below. I set the truncation cutoff T, representing the maximum number of components, at denotes the number of languages, S the number of environments in which sound changes occur, and N the number of data points in the data set. At a high level, this parameterization allows for the prior over components to be highly skewed such that certain components are favored and certain components have prior probabilities close to zero, as justified by the data.
environment s and every component t] For [for each data point (i.e., sound change instance)]
GEMdenotes the Griffiths-Engen-McCloskey distribution, which has the following function when parameterized by
Under this process, each data point has the following likelihood:
Marginalizing out the discrete variable z yields the following likelihood:
The posterior distributions of can be used to reconstruct the probability that a given data point is associated with a given dialect component:
I place uninformative Gamma(1, 1) priors over , since we do not know a priori the degree to which data points within a given language should be dispersed across components, or how many components we should expect to find. I fix
, the concentration parameter of the symmetric Dirichlet prior over each dialect component’s sound change distributions, at .0001 to encourage sparse sound change distributions.
I carry out inference using ADVI (Kucukelbir et al. 2017) in PyMC3 (Salvatier et al. 2016). ADVI allows users to define flexible and complex differentiable Bayesian models using a wide range of prior distributions over parameters. Certain probability distributions have constrained support: e.g., all samples from the Dirichlet distribution must be simplices summing to one; all samples from the Gamma distribution must be greater than zero. Parameters are mapped to unconstrained space and approximated with Gaussian variational posterior distributions, the parameters of which can be straightforwardly optimized using stochastic gradient descent. In mean-field ADVI, variational posteriors consist of independent Gaussian distributions, whereas in full-rank ADVI, variational posteriors make up a multivariate Gaussian distribution with non-diagonal covariance; I employ mean-field ADVI for simplicity. I optimize the model’s variational parameters over 4 separate initializations of 100000 iterations each, monitoring the evidence lower bound (ELBO) for convergence. The learning rate and parameter of the Adam optimizer (Kingma and Ba 2015) are set to .01 and .8, respectively. Posterior samples for each parameter are generated by drawing 500 samples from the fitted variational posterior.
Mixture models suffer from the so-called label switching problem, in which indices of identical components differ across initializations/chains. To address this problem, I relabel the components inferred across initializations 2–4 by permuting component labels and selecting the permutation which minimizes the Kullback-Leibler divergence from the parameters for initialization 1 to the permuted parameters for the initialization under consideration. This allows us to average parameters across initializations, providing an approximation to the MAP configuration over component assignments for each item in the data set. Aggregating over these assignments produces MAP language-level distributions over component makeup.
Posterior distributions over dialect components for sound change instances, averaged by type
The following table gives for every sound change instance in our data set, averaged across sound change types.
Posterior distributions over dialect components for sound change instances, raw values
https://github.com/chundrac/w_ir_layers/blob/master/p_z_all gives every sound change instance in our data set.