Package 'catdata'

Title: Categorical Data
Description: This R-package contains examples from the book "Regression for Categorical Data", Tutz 2012, Cambridge University Press. The names of the examples refer to the chapter and the data set that is used.
Authors: Gunther Schauberger, Gerhard Tutz
Maintainer: Gunther Schauberger <[email protected]>
License: GPL-2
Version: 1.2.4
Built: 2025-01-30 05:55:40 UTC
Source: https://github.com/schaubert/catdata

Help Index


Categorical Data

Description

This R-package contains examples from the book

Tutz (2012): Regression for Categorical Data, Cambridge University Press

The names of the examples refer to the chapter and the data set that is used.

The data sets are

addiction,
aids,
birth,
children,
deathpenalty,
dust,
encephalitis,
foodstamp,
insolvency,
knee,
leucoplakia,
medcare,
reader,
recovery,
rent,
rethinopathy,
teratology,
teratology2,
unemployment,
vaso.

The chapters are abbreviated in the following way

intro Chapter 1 Introduction
binary Chapter 2 Binary Regression: The Logit Model
glm Chapter 3 Generalized Linear Models
modbin Chapter 4 Modeling of Binary Data
altbin Chapter 5 Alternative Binary Regression Models
regsel Chapter 6 Regularization and Variable Selection for Parametric Models (vignettes were removed)
count Chapter 7 Regression Analysis of Count Data
multinomial Chapter 8 Multinomial Response Models
ordinal Chapter 9 Ordinal Response Models
semiparametric Chapter 10 Semi- and Nonparametric Generalized Regression
tree Chapter 11 Tree-Based Methods
loglinear Chapter 12 The Analysis of Contingency Tables
multivariate Chapter 13 Multivariate Response Models
random Chapter 14 Random Effects and Finite Mixtures
prediction Chapter 15 Prediction and Classification

The examples are abbreviated by chaptername-dataset. Thus, for example,

modbin-dust

refers to Chapter 4 (Modeling of Binary Data) and the data set dust.

Overview of examples:

  • Chapter 2:

    • binary-vaso: Example 2.2

    • binary-unemployment: Example 2.3

  • Chapter 4:

    • modbin-unemployment: Example 4.3

    • modbin-foodstamp: Example 4.4

    • modbin-dust: Example 4.7

  • Chapter 5:

    • altbin-teratology: Example 5.1

  • Chapter 7:

    • count-children: Example 7.3

    • count-encephalitis: Example 7.4

    • count-insolvency: Example 7.5

    • count-medcare: Example 7.6

  • Chapter 8:

    • multinomial-party1: Example 8.3

    • multinomial-party2: Example 8.3

    • multinomial-travel: Example 8.4

    • multinomial-addiction1: Example 8.5

    • multinomial-addiction2: Example 8.6

  • Chapter 9:

    • ordinal-knee1: Example 9.3

    • ordinal-knee2: Example 9.4

    • ordinal-retinopathy1: Example 9.5

    • ordinal-retinopathy2: Example 9.6

    • ordinal-arthritis: Example 9.8

  • Chapter 10:

    • semiparametric-unemployment: Example 10.2

    • semiparametric-dust: Example 10.3

    • semiparametric-children: Example 10.4

    • semiparametric-addiction: Example 10.5

  • Chapter 11:

    • tree-unemployment: Example 11.1

    • tree-dust: Example 11.2

  • Chapter 12:

    • loglinear-birth: Example 12.3

    • loglinear-leukoplakia: Example 12.5

  • Chapter 13:

    • multivariate-birth1: Examlpe 13.3

    • multivariate-knee: Example 13.4

    • multivariate-birth2: Example 13.5

  • Chapter 14:

    • random-knee1: Example 14.3

    • random-knee2: Example 14.4

    • random-aids: Example 14.6

    • random-betablocker: Example 14.7

    • random-knee3: Example 14.8

  • Chapter 15:

    • prediction-glass: Example 15.4 (vignette was removed)

    • prediction-medcare: Example 15.8

Author(s)

Gerhard Tutz and Gunther Schauberger with contributions from Sarah Maierhofer and Marcus Groß

Maintainer:
Gunther Schauberger <[email protected]>
Gerhard Tutz <[email protected]>

References

Gerhard Tutz (2012), Regression for Categorical Data, Cambridge University Press

Examples

## Not run: 
if(interactive()){vignette("modbin-dust")}

## End(Not run)

Are addicted weak-willed, deseased or both?

Description

The addiction data stems from a survey comprising 712 respondents.

Usage

data(addiction)

Format

A data frame with 712 observations on the following 4 variables.

ill

are addicted weak-willed(0) deseased(1) or both(2)

gender

male = 0, female = 1

age

age of surveyed person

university

surveyed person is academician(1) or not(0)

Source

Data Archive Department of Statistics, LMU Munich

Examples

## Not run: 
##look for:
if(interactive()){vignette("semiparametric-addiction")}
if(interactive()){vignette("multinomial-addiction1")}
if(interactive()){vignette("multinomial-addiction2")}


## End(Not run)

AIDS

Description

The aids data was a survey around 369 men who were infected with HIV.

Usage

data(aids)

Format

A data frame with 2376 observations on the following 8 variables.

cd4

number of CD4 cells

time

years since seroconversion

drugs

recreational drug use (yes=1/no=0)

partners

number of sexual partners

packs

packs of cigarettes a day

cesd

a mental illness score

age

Age centered around 30

person

Identification number

Source

Multicenter AIDS Cohort Study (MACS), see Zeger and Diggle (1994), Semi-parametric models for longitudinal data with application to CD4 cell numbers in HIV seroconverters, Biometrics, 50, 689–699.

Examples

## Not run: 
##look for:
if(interactive()){vignette("random-aids")}

## End(Not run)

Birth

Description

The birth data contain information about birth and pregnancy of 775 children that were born alive in the time from 1990 to 2004. The data were collected from internet users recruited on french-speaking pregnancy and birth websites

Usage

data(birth)

Format

A data frame with 775 observations on the following 25 variables.

IndexMother

ID variable

Sex

Sex of child: male = 1, female = 2

Weight

Weight of child at the birth in grams

Height

Height of child at the birth in centimeter

Head

Head circumference of child at the birth in centimeter

Month

Month of birth from 1 to 12

Year

Year of birth

Country

Country of birth: France (FR), Belgium (BE), Switzerland (CH), Canada (CA), Great Britain (GB), Germany (DE), Spain (ES), United States (US)

Term

Term of pregnancy in weeks from the last menstruation

AgeMother

Age of mother on the day of birth

Previous

Number of pregnancies before

WeightBefore

Weight of mother before the pregnancy

HeightMother

Height of mother in centimeter

WeightEnd

Weight of mother after the pregnancy

Twins

Was the pregnancy a multiple birth? no = 0, yes = 1

Intensive

Days that child spent in intensive care unit

Cesarean

Has the child been born by cesarean section? no = 0, yes = 1

Planned

Has the cesarean been planned? no = 0, yes = 1

Episiotomy

Has an episiotomy been made? no = 0, yes = 1

Tear

Did a perineal tear appear? no = 0, yes = 1

Operative

Has an operative aid like delivery forceps or vakuum been used? no = 0, yes = 1

Induced

Has the birth been induced artificially? no = 0, yes = 1

Membranes

Did the membrans burst before the beginning of the throes? no = 0, yes = 1

Rest

Has a strict bed rest been ordered to the mother for at least one month during the pregnancy? no = 0, yes = 1

Presentation

Presentation of the child before the birth? cephalic presentation = 1, pelvic presentation = 2, other presentation (e.g. across) = 3

Source

see Boulesteix (2006), Maximally selected chi-squared statistics for ordinal variables, Biometrical Journal, 48, 451–462.

Examples

## Not run: 
##look for:
if(interactive()){vignette("loglinear-birth")}
if(interactive()){vignette("multivariate-birth1")}
if(interactive()){vignette("multivariate-birth2")}

## End(Not run)

Number of Children

Description

The children data contains the information about the number of children of women.

Usage

data(children)

Format

A data frame with 3548 observations on the following 6 variables.

child

number of children

age

age of woman in years

dur

years of education

nation

nationality of the woman: 0 = German, 1 = otherwise

god

Beliving in god: 1 = Strong agreement, 2 = Agreement 3 = No definite opinion, 4 = Rather no agreement, 5= No agreement at all 6= Never thougt about it

univ

visited university: 0 = no, 1 = yes

Source

German General Social Survey Allbus

Examples

## Not run: 
##example of analysis:
if(interactive()){vignette("count-children")}
if(interactive()){vignette("semiparametric-children")}

## End(Not run)

Death-Penalty

Description

The deathpenalty data is about the judgemt of defendants in cases of multiple murders in Florida between 1976 and 1987. They are classified with respect to death penalty, race of defendent and race of victim.

Usage

data(deathpenalty)

Format

A data frame with 8 observations on the following 4 variables. Considering the weighting variable "Freq", there are 674 cases.

DeathPenalty

Was the judgment death penalty? yes = 1, no = 0

VictimRace

The race of the victim: white = 1, black = 0

DefendantRace

The race of the defendant: white = 1, black = 0

Freq

Frequency of observation

Source

Agresti, A. (2002) Categorical Data Analysis. Wiley

References

Agresti, A. (2002) Categorical Data Analysis. Wiley

Examples

## Not run: 
##look for:
data(deathpenalty)

## End(Not run)

Chronic Bronchial Reaction to Dust

Description

The dust data was surveyed among the employees of a Munich factory.

Usage

data(dust)

Format

A data frame with 1246 observations on the following 4 variables.

bronch

chronical bronchial reaction, no = 0, yes = 1

dust

dust concentration (mg/cm^3) at working place

smoke

employee smoker?, no = 1, yes = 2

years

years of dust exposition

Source

Data Archive Department of Statistics, LMU Munich

Examples

## Not run: 
##example of analysis:
if(interactive()){vignette("modbin-dust")}
if(interactive()){vignette("semiparametric-dust")}
if(interactive()){vignette("tree-dust")}

## End(Not run)

Cases of Herpes Encephalitis in Bavaria and Saxony

Description

The encephalitis data is based on a study on the occurence of herpes encephalitis in children. It was observed in Bavaria and Lower Saxony between 1980 and 1993.

Usage

data(encephalitis)

Format

A data frame with 26 observations containing the following variables

year

years 1980 to 1993 (1 – 14)

country

Bavaria = 1, Lower Saxony = 2

count

number of cases with herpes encephalitis

References

Karimi, A., Windorfer, A., Dreesemann, J. (1980) Vorkommen von zentralvenösen Infektionen in europäischen Ländern. Technical report, Schriften des Niedersächsischen Landesgesundheitsamtes.

Examples

## Not run: 
##look for:
if(interactive()){vignette("count-encephalitis")}

## End(Not run)

Food-Stamp Program

Description

The foodstamp data stem from a survey on the federal food-stamp program, 150 persons were interviewed. The response indicates participation.

Usage

data(foodstamp)

Format

A data frame with 150 observations on the following 4 variables.

y

participation in federal food-stamp program, yes = 1, no = 0

TEN

tenancy, yes = 1, no = 0

SUP

supplemental income, yes = 1, no = 0

INC

log-transformed monthly income log(monthly income +1)

References

Künsch, H. R., Stefanski, L. A., Carroll, R. J. (1989) Conditionally unbiased bounded-influence estimation in general regression models, with applications to generalized linear models. Journal of American Statistical Association 84, 460–466.

Examples

## Not run: 
##look for:
if(interactive()){vignette("modbin-foodstamp")}

## End(Not run)

Glass Identification

Description

A dataset coming from USA Forensic Science Service that distinguishes between six types of glass (four types of window glass, and three types nonwindow). Predictors are the refractive index and the oxide content of various minerals.

Usage

data(heart)

Format

A data frame with 214 observations on the following 10 variables.

RI

Refractive index

Na

Oxide content of sodium

Mg

Oxide content of magnesium

Al

Oxide content of aluminium

Si

Oxide content of silicon

K

Oxide content of potassium

Ca

Oxide content of calcium

Ba

Oxide content of barium

Fe

Oxide content of iron

type

Type of glass

Source

http://archive.ics.uci.edu/ml/datasets/Glass+Identification

References

Ripley, B. D. (1996), Pattern Recognition and Neural Networks, Cambridge University Press.

Examples

## Not run: 
##example of analysis:
if(interactive()){vignette("prediction-glass")}

## End(Not run)

Heart Disease

Description

A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa.

Usage

data(heart)

Format

A data frame with 462 observations on the following 10 variables.

y

coronary heart disease (yes = 1, no = 0)

sbp

systolic blood pressure

tobacco

cumulative tobacco

ldl

low density lipoprotein cholesterol

adiposity

adiposity

famhist

family history of heart disease

typea

type-A behavior

obesity

obesity

alcohol

current alcohol consumption

age

age at onset

References

South African Heart Disease dataset
Hastie, T., Tibshirani, R., and Friedman, J. (2001):
Elements of Statistical Learning; Data Mining, Inference, and Prediction, Springer-Verlag, New York

Examples

##example of analysis:
if(interactive()){vignette("regsel-heartdisease1")}
if(interactive()){vignette("regsel-heartdisease2")}
if(interactive()){vignette("regsel-heartdisease3")}
if(interactive()){vignette("regsel-heartdisease4")}
if(interactive()){vignette("regsel-heartdisease5")}
if(interactive()){vignette("regsel-heartdisease6")}

Insolvency of companies in Berlin

Description

The insolvency data gives the number of insolvent companies per month in Berlin from 1994 to 1996.

Usage

data(dust)

Format

A data frame with 36 observations on the following 4 variables.

insolv

number of insolvent companies

year

years 1994-1996 (1–3)

month

month (1-12)

case

number of cases (1–36)

Examples

## Not run: 
##example of analysis:
if(interactive()){vignette("count-insolvency")}

## End(Not run)

Knee Injuries

Description

In a clinical study n=127 patients with sport related injuries have been treated with two different therapies (chosen by random design). After 3,7 and 10 days of treatment the pain occuring during knee movement was observed.

Usage

data(knee)

Format

A data frame with 127 observations on the following 8 variables.

N

Patient's number

Th

Therapy ( placebo = 1, treatment = 2)

Age

Age in years

Sex

Gender (male = 0, female = 1)

R1

Pain before treatment (no pain = 1, severe pain = 5)

R2

Pain after three days of treatment

R3

Pain after seven days of treatment

R4

Pain after ten days of treatment

Examples

##example of analysis:
if(interactive()){vignette("ordinal-knee1")}
if(interactive()){vignette("ordinal-knee2")}
if(interactive()){vignette("multivariate-knee")}
if(interactive()){vignette("random-knee1")}
if(interactive()){vignette("random-knee3")}

Knee Injuries

Description

In a clinical study n=127 patients with sport related injuries have been treated with two different therapies (chosen by random design). After 3,7 and 10 days of treatment the pain occuring during knee movement was observed. The data set is a transformed version of knee for fitting a cumulative logit model.

Usage

data(knee)

Format

A data frame with 127 observations on the following 8 variables.

y

Response

Th

Therapy ( placebo = 1, treatment = 2)

Age

Age in years

Age2

Squared age

Sex

Gender (male = 0, female = 1)

Person

Person

Examples

##example of analysis:
if(interactive()){vignette("random-knee2")}

Knee Injuries

Description

In a clinical study n=127 patients with sport related injuries have been treated with two different therapies (chosen by random design). After 3,7 and 10 days of treatment the pain occuring during knee movement was observed. The data set is a transformed version of knee for fitting a sequential logit model.

Usage

data(knee)

Format

A data frame with 127 observations on the following 8 variables.

y

Response

Icept1

Intercept 1

Icept2

Intercept 2

Icept3

Intercept 3

Icept4

Intercept 4

Th

Therapy ( placebo = 1, treatment = 2)

Age

Age in years

Age2

Squared age

Sex

Gender (male = 0, female = 1)

Person

Person

Examples

##example of analysis:
if(interactive()){vignette("random-knee2")}

Leukoplakia

Description

The leukoplakia data is about occurence of oral leukoplakia with covariates smoking and alcohol consumption.

Usage

data(leukoplacia)

Format

A data frame with 16 observations on the following 4 variables. Considering the weighting variable "Freq", there are 212 cases.

Leukoplakia

Has the person oral leukoplakia? yes = 1, no = 0

Alcohol

How much alcohol did the person drink on average? no = 1, less then 40g = 2, less then 80g = 3, more then 80g = 4

Smoker

Smoker? yes = 1, no = 0

Freq

Frequency of observation

Source

Fahrmeir, Hamerle and Tutz (1996), Multivariate statistische Verfahren, Berlin: de Gruyter

Examples

## Not run: 
##look for:
if(interactive()){vignette("loglinear-leukoplakia")}

## End(Not run)

Number of Physician Office Visits

Description

The medcare data was collected on 4406 individuals, aged 66 and over, that were covered by medcare, a public insurence program

Usage

data(medcare)

Format

A data frame with 4406 observations on the following 9 variables.

ofp

number of physician office visits

hosp

number of hospital stays

healthpoor

indivudual has a poor health (reference: average health)

healthexcellent

indivudual has a excellent health

numchron

number of chronic conditions

male

female = 0, male = 1

age

age of individual (centered around 60)

married

married = 1, else = 0

school

years of education

Source

https://www.econ.queensu.ca

References

US National Medical Expenditure Survey in 1987/88

Examples

## Not run: 
##example of analysis:
if(interactive()){vignette("count-medcare")}
if(interactive()){vignette("prediction-medcare")}

## End(Not run)

Who is a Regular Reader?

Description

The reader data contains information on the reading behaviour of women refering to a specific woman's journal.

Usage

data(reader)

Format

A data frame with 48 observations on the following 5 variables. Considering the weighting variable "Freq", there are 941 observations.

RegularReader

Is the woman a regular reader? yes = 1, no = 0

Working

Is the woman working? yes = 1, no = 0

Age

Age of the woman in categories (18–29 years = 1, 30–39 = 2, 40–49 = 3)

Education

Level of education. L1 = 11, L2 = 12, L3 = 13, L4 = 14

Freq

Frequency of the observation

Source

Fahrmeir, Hamerle and Tutz (1996), Multivariate statistische Verfahren, Berlin: de Gruyter


Post-Surgery Recovery of Children

Description

The recovery data contains information on 60 children after a surgery.

Usage

data(recovery)

Format

A data frame with 240 observations on the following 10 variables

y

recovery score

Dos1

Dosage=15 (yes = 1, no = 0)

Dos2

Dosage=20 (yes = 1, no = 0)

Dos3

Dosage=25 (yes = 1, no = 0)

Age

Age of child (in months)

Age2

Squared age

Dur

Duration of surgery (in minutes)

Rep1

First repetition (yes = 1, no = 0)

Rep2

Second repetition (yes = 1, no = 0)

Rep3

Third repetition (yes = 1, no = 0)

Person

ID-Variable for each child (1–60)

Details

In a randomized study 60 children undergoing surgery were treated with one of four dosages of an anaesthetic (15, 20, 25, 30). Upon admission to the recovery room and at minutes 5, 15 and 30 following admission, recovery scores were assigned on a categorical scale ranging from 1 (least favourable) to 6 (most favourable). Therefore one has four repetitions of a variable having 6 categories. One wants to model how recovery scores depend on covariables as dosage of the anaesthetic (four levels), duration of surgery (in minutes) and age of the child (in months).

References

Davis, C.S. (1991) Semi-parametric and Non-parametric Methods for the Analysis of Repeated Measurements with Applications to Clinical Trials. Statistics in Medicine 10, 1959–1980


Rent in Munich

Description

The rent data contains the rent index for Munich in 2003.

Usage

data(rent)

Format

A data frame with 2053 observations on the following 13 variables.

rent

clear rent in euros

rentm

clear rent per square meter in euros

size

living space in square meter

rooms

number of rooms

year

year of construction

area

municipality

good

good adress, yes = 1, no =0

best

best adress, yes = 1, no = 0

warm

warm water, yes = 0, no = 1

central

central heating, yes = 0, no = 1

tiles

bathroom with tiles, yes = 0, no = 1

bathextra

special furniture in bathroom, yes = 1, no = 0

kitchen

upmarket kitchen, yes = 1, no = 0

Source

Data Archive Department of Statistics, LMU Munich

References

Fahrmeir, L., Künstler, R., Pigeot, I., Tutz, G. (2004) Statistik: der Weg zur Datenanalyse. 5. Auflage, Berlin: Springer-Verlag.

Examples

##example of analysis:
data(rent)
summary(rent)

Retinopathy

Description

The retinopathy data contains information on persons with retinopathy.

Usage

data(retinopathy)

Format

A data frame with 613 observations on the following 5 variables.

RET

RET=1: no retinopathy, RET=2 nonproliferative retinopathy, RET=3 advanced retinopathy or blind

SM

SM=1: smoker, SM=0: non-smoker

DIAB

diabetes duration in years

GH

glycosylated hemoglobin measured in percent

BP

diastolic blood pressure in mmHg

References

Bender and Grouven (1998), Using binary logistic regression models for ordinal data with non-proportional odds, J. Clin. Epidemiol., 51, 809–816.

Examples

## Not run: 
## look for
if(interactive()){vignette("ordinal-retinopathy1")}
if(interactive()){vignette("ordinal-retinopathy2")}
 
## End(Not run)

Teratology

Description

In a teratology experiment 58 rats on iron-deficient diets were assigned to four groups. In the first group only placebo injections were given, in the other groups iron supplements were given. The animals were made pregnant and sacrificed after three weeks. The response is the number of living and dead rats of a litter.

Usage

data(teratology)

Format

A data frame with 58 observations on the following 3 variables.

D

number of deaths of rats litter

L

number survived of rats litter

Grp

group(Untreated = 1, Injections days 7 and 10 = 2, Injections days 0 and 7 = 3, Injections weekly = 4

References

Moore, D. F. and Tsiatis, A. (1991) Robust estimation of the variance in moment methods for extra-binomial and extra-poisson variation. Biometrics 47, 383–401.

Examples

data(teratology)
summary(teratology)
## Not run: 
if(interactive()){vignette("altbin-teratology")}

## End(Not run)

Teratology2

Description

In a teratology experiment 58 rats on iron-deficient diets were assigned to four groups. In the first group only placebo injections were given, in the other groups iron supplements were given. The animals were made pregnant and sacrificed after three weeks. The response was whether the fetus was dead (yij = 1) for each fetus in each rats litter.

Usage

data(teratology2)

Format

A data frame with 607 observations on the following 3 variables.

y

dead = 1, living = 0

Rat

Number of animal

Grp

treatment group

References

Moore, D. F. and Tsiatis, A. (1991) Robust estimation of the variance in moment methods for extra-binomial and extra-poisson variation. Biometrics 47, 383–401.

Examples

## Not run: 
data(teratology2)
if(interactive()){vignette("altbin-teratology")}

## End(Not run)

long term/short term unemployment

Description

The unemployment data contains information on 982 unemployed persons.

Usage

data(unemployment)

Format

A data frame with 982 observations on the following 2 variables.

age

age of the person in years (from 16 to 61)

durbin

short term (1) or long-term (2) unemployment

Source

Socio-economic panel 1995

Examples

## Not run: 
##look for:
if(interactive()){vignette("binary-unemployment")}
if(interactive()){vignette("modbin-unemployment1")}
if(interactive()){vignette("modbin-unemployment2")}
if(interactive()){vignette("semiparametric-unemployment")}
if(interactive()){vignette("tree-unemployment")}

## End(Not run)

Vasoconstriciton and Breathing

Description

The vaso data contains binary data. Three test persons inhaled a certain amount of air with different rates. In some cases a vasoconstriction (neural constriction of vasculature) occured at their skin. The goal of the study was to indicate a correlation between breathing and vasoconstriction. The test persons repeated the test 9, 8, 22 times. So the dataframe has 39 observations.

Usage

data(vaso)

Format

A data frame with 39 observations on the following 3 variables.

vol

amount of air

rate

rate of breathing

vaso

condition of vasculature: no vasoconstriction = 1, vasoconstriction = 2

Source

Data Archive Department of Statistics, LMU Munich

References

Finney, D. J. (1971) Probit Analysis. 3rd edition. Cambridge University Press.

Pregibon, D. (1982) Resistant fits for some commonly used logistic models. Appl. Stat. 29, 15–24.

Hastie, T. J. and Tibshirani, R. J. (1990) Generalized Additve Models. Chapman and Hall.

Examples

## Not run: 
##look for:
if(interactive()){vignette("binary-vaso")}

## End(Not run)