random phrase generation and gender/human agreement in English

82 Views Asked by At

I am trying to generate random practice phrases in English for a morse code trainer. I am trying to figure out how to deal with gender agreement in English. I'd like to be able to generate phrases like "He is a son", "She is a mother", "It is a door", but avoid things like "He is a mother", "She is a door", "It is a father". "He is a mother" mixes genders, and sentences like "She is a door" and "It is a father" mix human/non-human. It seems that in the rgl, human and nonhuman have the Gender type.

There are times when that sort of thing is acceptable, such as the phrase "No man is an island". And, for some reason, gender reveal parties often use phrases like "Its a boy!". But, I'm just trying to generate training data, so I am trying to focus on common usage.

I am very new to grammatical framework, so I could be approaching this entirely wrong. Here is what I have so far,

In Agreement.gf

abstract Agreement = {

flags startcat = Message ;

cat
  Message ; Subject ; SubjectComplement ;
fun
  Is  : Subject -> SubjectComplement -> Message ;
  He, She, It  : Subject;
  Son, Daughter, Father, Mother, Fence, Door : SubjectComplement;
}

In AgreementEng.gf

concrete AgreementEng of Agreement = open DictEng, SyntaxEng, ParadigmsEng, VerbEng, ResEng in {
lincat
  Message  = Cl ;
  Subject  = NP;
  SubjectComplement = CN;
lin
  Is s sc = mkCl s sc;
  He  = DictEng.he_Pron;
  She = DictEng.she_Pron;
  It = DictEng.it_Pron;
  Son = mkCN son_N;
  Daughter = mkCN daughter_N;
  Mother = mkCN mother_N;
  Father = mkCN father_N;
  Fence  = mkCN fence_N;
  Door  = mkCN fence_N;
}

If I load this into gf and run generate_random | linearize, it works, but ignores gender and humanness.

I see that in DictEng there is some gender/nonhuman markers for the pronouns,

lin she_Pron = mkPron "she" "her" "her" "hers" singular P3 feminine ;
lin he_Pron = mkPron "he" "him" "his" "his" singular P3 masculine ;
lin it_Pron  = mkPron "it" "it" "its" "its" singular P3 nonhuman;

Though not for most nouns,

lin mother_N = mkN "mother" "mothers";
lin daughter_N = mkN "daughter" "daughters";

Though some do have gender marked,

lin actor_N = mkN masculine (mkN "actor" "actors");
lin actress_N = mkN feminine (mkN "actress" "actresses");

How would you approach this?

I am open to suggestions for any aspects of this code -- not just the gender issue. My overall goal is to generate increasingly complex, vaguely sensical english phrases. Think Duo Lingo -- but for morse code. I will have a bunch of training levels which build on top of previous levels adding new vocabulary, longer sentences, etc.

At the moment, I do not care about non-English languages -- that is a problem for future me. I also do not need to support for everything in DictEng. The list of potential words and phrases will be hand curated.

Using what is shown so far, I'd start by training on individual words, "he", "she", "it", "is", "son", etc.

Then simple phrases "he is", "she is", "it is".

Then finally full sentences like "he is a son".

Then I would add the plurals, "we", "they", "are", "sons", etc. Then I'd train the new words individually. Then phrases like "we are", "they are", etc. And then sentences "we are fathers". And then I'd do a mixture of singular and plural sentences.

So, in the grammar files I need the granularity to generate each of these different types of training phrases.

Thanks!

(Not sure it matters but I have decades of Haskell experience and dabble in things like Idris. So I think I am fine with the grammatical framework language -- my trouble is more in understanding the libraries (rgl) and big picture).

1

There are 1 best solutions below

1
On

The RGL Gender parameter is only controlling things like "she sees herself/he sees himself/the tree sees itself", but nothing more semantic than that. So if you want to control that your sentences make sense, then you need to add a custom parameter.

Here's a concrete syntax that works, in that it just doesn't linearise combinations where the genders don't match.

(Btw, I replaced your Cl with S, because Cl is open for tense, polarity, mood etc., and English just happens to output present indicative in the GF shell, but you can't trust that to happen in other languages.)

concrete AgreementEng of Agreement = open DictEng, SyntaxEng, ParadigmsEng, Prelude in {
lincat
  Message  = S ;
  Subject  = {np : NP ; g : HumanGender} ;
  SubjectComplement = {cn : CN ; g : HumanGender} ;

param
  HumanGender = M | F | Inanimate ;

lin
  Is s sc = case <s.g, sc.g> of {
    <Inanimate,Inanimate>
    |<M,M>
    |<F,F> => mkS (mkCl s.np sc.cn) ;
    _      => noS
    } ;
  He  = {np = he_NP ; g = M} ;
  She = {np = she_NP ; g = F} ;
  It = {np = it_NP ; g = Inanimate} ;
  Son = mkSubjCompl son_N M ;
  Daughter = mkSubjCompl daughter_N F;
  Mother = mkSubjCompl mother_N F ;
  Father = mkSubjCompl father_N M ;
  Fence  = mkSubjCompl fence_N Inanimate ;
  Door  = mkSubjCompl door_N Inanimate ;

oper
  mkSubjCompl : N -> HumanGender -> {cn : CN ; g : HumanGender} = \n,g ->
    {cn = mkCN n ; g = g} ;

  noS : S = mkS (mkCl (mkN nonExist)) ;
}

This oper noS is made from the nonExist token, which just causes an exception and prints nothing. So when you generate all trees and linearise them, this is what you get:

Agreement> gt  | l
he is a father
he is a son
it is a door
it is a fence
she is a daughter
she is a mother

But if you do gt | l -treebank, you will see that it generated many more trees, but just didn't linearise those, where the HumanGenders didn't match.

For a softer option, you can have it output the sentence (like "she is a father"), but append something at the end. Here's another approach, where the first concrete outputs everything, but you have a second concrete only for plausibility filtering: https://github.com/michmech/plausibility#readme

Finally, it might be interesting to read this blog post. It's not directly related to your question, but it provides some general philosophy how to think about things in GF.