[kaggle] Titanic: Machine Learning from Disaster

728x90

Kaggle 에서 제공하는 타이타닉 데이터를 불러와 생존자 분석 - 샘플 실습

https://www.kaggle.com/c/titanic/data

** ( ~ Sat 31 Dec 2016 )

* 필드 설명
        survival        Survival          (0 = No; 1 = Yes)
        pclass          Passenger Class   (1 = 1st; 2 = 2nd; 3 = 3rd)
        name            Name
        sex             Sex
        age             Age
        sibsp           Number of Siblings/Spouses Aboard
        parch           Number of Parents/Children Aboard
        ticket          Ticket Numbersibsp
        fare            Passenger Fare
        cabin           Cabin
        embarked        Port of Embarkation

* 작업내용

1. elasticsearch-spark 를 활용하여 elasticsearch에 데이터를 저장

- sparkConf 설정

public static SparkConf conf = 
    new SparkConf().setMaster("local").setAppName("TitanicData")
                  .set("es.index.auto.create", "true")    
                  .set("es.nodes.discovery", "true")
                  .set("es.nodes", "192.168.40.18:9200");

public static JavaSparkContext sc = new JavaSparkContext(conf);

- 파일을 불러와 Elasticsearch에 저장

SQLContext sql = new SQLContext(sc);
DataFrame data = 
        sql.read().format("com.databricks.spark.csv")
                  .option("header", "true")
                  //.option("charset", "UTF8")
                  .option("inferSchema", "true")
                  .load("C:\\data\\titanic\\train_total.csv");

EsSparkSQL.saveToEs(data, "titanic/train");

2. elasticsearch에서 데이터 로드

SQLContext sql = new SQLContext(sc);     
DataFrame data sql.read().format("org.elasticsearch.spark.sql")
                 .option("header", "true")
                 //.option("charset", "UTF8")
                 .option("inferSchema", "true")         // schema 자동 추론
                 .load("titanic/train");

3. 데이터 중 "Age"필드에 빈값을 대표값(평균)으로 셋팅

Double avgAge = Double.parseDouble(
	String.format("%.0f",  data.select(functions.avg("Age")).first().getDouble(0)));
    
String[] fill_col = {"Age"};
data = data.na().fill(avgAge.doubleValue(), fill_col);

4. 필요없는 데이터 삭제

데이터 타입을 double형태로 변환 ( features는 반드시 dobule값 )

data = data.drop("Name")
           .drop("Cabin")
           .drop("Ticket")
           .drop("Embarked")
           .withColumn("Survived", data.select("Survived").col("Survived").cast("double")   );

5. 성별 값을 코드값으로 변환

StringIndexer sexIndexer = new StringIndexer().setInputCol("Sex").setOutputCol("SexIndex");

6. 이용금액을 카테고리로 분류

double[] fareSplit = {
    0.0
    , 50.0
    , 100.0
    , 150.0
    , 200.0
    , Double.POSITIVE_INFINITY};

Bucketizer fareBucket
    = new Bucketizer().setInputCol("Fare").setOutputCol("FareBucket").setSplits(fareSplit);

7. 알고리즘 적용 시 사용할 필드를 선택

String[] assembly_col = {"Age", "SexIndex", "FareBucket", "SibSp", "Parch", "Pclass"};

VectorAssembler assembly = new VectorAssembler().setInputCols(assembly_col)
                                                .setOutputCol("tempResult");

8. 범위 형태의 값을 동일한 기준값으로 변환 (표준화)

Normalizer normalizer = new Normalizer().setInputCol("tempResult").setOutputCol("features");

9. 알고리즘 생성 ( LogisticRegression 적용 )

LogisticRegression logreg = new LogisticRegression().setMaxIter(10);
logreg.setLabelCol("Survived");

10. 파이프라인 생성

Pipeline pipeline = new Pipeline().setStages(
    new PipelineStage[] {fareBucket, sexIndexer, assembly, normalizer, logreg});

11. 파이프라인 모델 생성 후 결과 추출

PipelineModel model = pipeline.fit(data);
DataFrame result = model.transform(data);

12. 결과 평가

inaryClassificationMetrics metrics = 
    new BinaryClassificationMetrics(result.select("prediction", "Survived"));

System.out.println(metrics.areaUnderROC());

728x90

'공부 > Apache spark' 카테고리의 다른 글

[Spark] Spark Cluster 운영하기 (0)	2021.04.12
[Spark] 스파크 설치 & 기본실행 (0)	2021.04.12

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

공부하자

[kaggle] Titanic: Machine Learning from Disaster

'공부 > Apache spark' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[kaggle] Titanic: Machine Learning from Disaster

'공부 > Apache spark' 카테고리의 다른 글

'공부/Apache spark' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역