๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

์ธ๊ณต์ง€๋Šฅ

[Day 16] NLP - Bag-of-Words & Words Embedding

๐Ÿ“ NLP์˜ ์ข…๋ฅ˜

    - major conferences: ACC, EMNLP, NAACL

  ๐Ÿ”ฅ Low-level parsing: Tokenization, stemming

  ๐Ÿ”ฅ Word and Phrase level

    - NER(Named Entity Recognition), POS(Part-of-Speech) tagging, Noun-Phrase chunking, Dependency parsing, Coreference resulution

  ๐Ÿ”ฅ Sentence level: Sentiment analysis, Machine translation

  ๐Ÿ”ฅ Multi-sentence and Paragraph level: Entailment prediction, Question answering, Dialog systems, Sumarization

 

  ๐Ÿ”ฅ Text Mining

    - ๊ธ€์ด๋‚˜ ๋ฌธ์„œ ๋ฐ์ดํ„ฐ์—์„œ ํ™œ์šฉ๊ฐ€๋Šฅํ•œ ์ •๋ณด๋‚˜ insight๋ฅผ ์ถ”์ถœํ•ด ๋‚ด๋Š” ๊ฒƒ.

    - Document clustering

 

  ๐Ÿ”ฅ Information retrieval: social science์™€ ๋†’์€ ์—ฐ๊ด€์ด ์žˆ์–ด ์ถ”์ฒœ ์„œ๋น„์Šค์— ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๊ธฐ์ˆ 

 

  ๐Ÿ”ฅ NLP ํŠธ๋ Œ๋“œ

    - ๊ฐ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ด ์ฒ˜๋ฆฌ -> RNN๊ณ„์—ด ๋ชจ๋ธ์„ ์‚ฌ์šฉ(LSTM, GRU) -> Attention module๊ณผ Transformer model ์‚ฌ์šฉ -> Self-Supervised Training์„ ํ™œ์šฉ(BERT, GPT-3 ๋“ฑ)

    - ๊ฒฐ๊ตญ ๋งŽ์€ ์ž๋ณธ๊ณผ ์ •๋ณด๋ฅผ ๊ฐ€์ง„ ์ „ ์„ธ๊ณ„์  ๊ธฐ์—…(Tesla, Google ๋“ฑ)์—์„œ ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํžˆ ์ง„ํ–‰

 

๐Ÿ“ Bag-of-Words

    - ๋ฌธ์žฅ ๋‚ด ๊ฐ ๋‹จ์–ด๋“ค์„ one-hot vector๋กœ ๊ณ ์นœ ๊ฐ’์œผ๋กœ ์ „๋ถ€ ๋”ํ•˜์—ฌ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ.

    1. ์˜ˆ์‹œ ๋ฌธ์žฅ๋“ค์˜ ๊ฐ ๋‹จ์–ด๋ฅผ vocabulary๋ผ๋Š” ๊ณต๊ฐ„์— uniqueํ•˜๊ฒŒ ๋„ฃ์Œ

    2. uniqueํ•œ ๋‹จ์–ด๋“ค์„ one-hot vector๋กœ encoding ํ•จ

    3. ๋ฌธ์žฅ์„ one-hot vector๋“ค์˜ ํ•ฉ์œผ๋กœ ๋‚˜ํƒ€๋ƒ„.

 

  ๐Ÿ”ฅ Naive bayes classifier

    -  ์ŠคํŒธ ๋ฉ”์ผ ํ•„ํ„ฐ, ํ…์ŠคํŠธ ๋ถ„๋ฅ˜, ๊ฐ์ • ๋ถ„์„, ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋“ฑ์— ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ํ™œ์šฉ๋˜๋Š” ๋ถ„๋ฅ˜ ๊ธฐ๋ฒ•

    - ํŠน์ • document๋ฅผ d๋ผํ•˜๊ณ  ์ „์ฒด class๋ฅผ c๋ผ ํ–ˆ์„ ๋•Œ, ์•„๋ž˜์™€ ๊ฐ™์€ ์‹์ด ๋‚˜์˜ด

    - ๋˜ํ•œ, P(d|c)๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

 

๐Ÿ“ Word Embedding

  ๐Ÿ”ฅ Embedding

    - ์ž์—ฐ์–ด๋ฅผ ์ •๋ณด์˜ ๊ธฐ๋ณธ ๋‹จ์œ„๋กœ ํ•ด(Sequence)๋ณผ ๋•Œ, ๊ฐ ๋‹จ์–ด๋“ค์„ ํŠน์ • ์ฐจ์›์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๊ณต๊ฐ„ ์ƒ์˜ ํ•œ ์ , ํ˜น์€ ์ ์˜ ์ขŒํ‘œ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•ด ์ฃผ๋Š” ๊ธฐ๋ฒ•

    - ๋น„์Šทํ•œ ์˜๋ฏธ๋Š” ๊ฐ€๊นŒ์šด ๊ฑฐ๋ฆฌ์— ์ƒ์ถฉ๋˜๋Š” ์˜๋ฏธ๋Š” ๋ฉ€๋ฆฌ

 

  ๐Ÿ”ฅ Word2Vec

    - ์ฃผ์–ด์ง„ ์ž๋ฃŒ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํŠน์ • ์–ธ์–ด์™€์˜ ๊ด€๊ณ„๋ฅผ ์ •์˜ํ•ด ํ•™์Šต

    - ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์—์„œ ๊ฐ€์žฅ ์˜๋ฏธ๊ฐ€ ์ƒ์˜ํ•œ ๋‹จ์–ด๋ฅผ ์ฐพ์•„๋ƒ„(Word intrusion detction)

 

    - Word2Vec์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์„ ๋‹จ์–ด๋กœ ์ชผ๊ฐœ๊ณ  window sliding์„ ํ™œ์šฉํ•ด ๋ฌธ์žฅ๊ฐ„์˜ ์œ ์‚ฌ๋„ ๋ฐ ๊ด€๋ จ์„ฑ์„ ์ฐพ์•„ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ด ๋‚ธ๋‹ค.

 

  ๐Ÿ”ฅ GloVe(Global Vectors)

    - ๊ฐ ์ž…๋ ฅ, ์ถœ๋ ฅ ์Œ๋“ค์— ๋Œ€ํ•ด์„œ ํ•™์Šต ๋ฐ์ดํ„ฐ์—์„œ ๋‘ ๋‹จ์–ด๊ฐ€ ํ•œ ์œˆ๋„์šฐ ๋‚ด์—์„œ ์ด ๋ช‡ ๋ฒˆ ๋™์‹œ์— ๋“ฑ์žฅ ํ–ˆ๋Š”์ง€๋ฅผ ์‚ฌ์ „์— ๊ณ„์‚ฐํ•˜์—ฌ(Pij)์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•จ.

    - Word2Vec๋ณด๋‹ค ๋น ๋ฆ„