Image Representations and New Domains in Neural Image Captioning. Jack Hessel, Nic Saava, Mike Wilber

Size: px

Start display at page:

Download "Image Representations and New Domains in Neural Image Captioning. Jack Hessel, Nic Saava, Mike Wilber"

Phyllis Barker
6 years ago
Views:

1 Image Representations and New Domains in Neural Image Captioning Jack Hessel, Nic Saava, Mike Wilber

2 1 The caption generation problem...? A young girl runs through a grassy field.

3 1? A you throug

4 1 Image Model? Language Model A you throug

5 1? Image Model? Language Model A you throug

6 2 The world of neural caption generation models... [Karpathy and Li 2014] [Mao et al. 2014] [Vinyals et al. 2014] [Kiros et al. 2014] [Donahue et al. 2014] [Fang et al. 2014]

7 3 [Vinyals et al. 2014]

8 4 Neural Image Caption (Vinyal s et. al. 2014)? A young girl runs through a grassy field.

9 4 Neural Image Caption (Vinyal s et. al. 2014) A young girl runs through a grassy field.

10 4 Neural Image Caption (Vinyal s et. al. 2014) A young girl runs through a grassy field.

11 4 Neural Image Caption (Vinyal s et. al. 2014) A young girl runs through a grassy field. Convolutional Neural Network

12 4 Neural Image Caption (Vinyal s et. al. 2014) Fixed Representation A young girl runs through a grassy field. Fixed Convolutional Neural Network + Pretrained

13 4 Neural Image Caption (Vinyal s et. al. 2014) Fixed Representation A young girl A young girl runs through a grassy field. * Fixed Convolutional Neural Network + Pretrained A young

14 4 Neural Image Caption (Vinyal s et. al. 2014) Fixed Representation A young girl A young girl runs through a grassy field. * Fixed Convolutional Neural Network + Pretrained A young Long Short Term Memory Recurrent Neural Network [Hochreiter and Schmidhuber 1997]

15 5 Sutskever et al The meaning of life is the tradition of the ancient human reproduction: it is less favorable to the good boy for when to remove her bigger. In the show s agreement unanimously resurfaced. The wild pasteured with consistent street forests were incorporated by the 15th century BE...

16 5 Sutskever et al The meaning of life is the tradition of the ancient human reproduction: it is less favorable to the good boy for when to remove her bigger. In the show s agreement unanimously resurfaced. The wild pasteured with consistent street forests were incorporated by the 15th century BE...

17 5 Sutskever et al The meaning of life is the tradition of the ancient human reproduction: it is less favorable to the good boy for when to remove her bigger. In the show s agreement unanimously resurfaced. The wild pasteured with consistent street forests were incorporated by the 15th century BE...

18 6

19 6

20 6 Higher Quality Representations

21 6 Higher Quality Representations

22 6 Higher Quality Representations

23 6 Higher Quality Representations?? Higher Quality Captions??

24 7 Caption Quality Plausible outcomes... Higher Quality Representations

25 7 Caption Quality Plausible outcomes... Higher Quality Representations

26 7 Caption Quality Plausible outcomes... Higher Quality Representations

27 7 Caption Quality Plausible outcomes... Higher Quality Representations

28 8 Classification Accuracy Training Iterations

29 8 Classification Accuracy Training Iterations

30 8 Classification Accuracy Training Iterations

31 8 Less Blur Classification Accuracy Training Iterations

32 8 Less Blur Classification Accuracy Training Iterations

33 9 Directly learned: Flickr8k [Hodosh et al. 2013]

34 10

35 10

36 10

37 10

38 11 Malaysian Spicy Noodles 66K Image/recipe pairs, courtesy Yummly.com

39 12 Is this really a new language task?

40 12 Is this really a new language task?

41 12 Is this really a new language task? Less variety overall

42 12 Is this really a new language task? malaysian spicy noodles chicken curry with lime cheesy pizza bagels with mushrooms... Less variety overall

43 12 Is this really a new language task? malaysian spicy noodles chicken curry with lime cheesy pizza bagels with mushrooms... Less variety overall Shorter captions

44 13 Fixed Representation Fixed Convolutional Neural Network + Pretrained

45 13 =/=

46 14 Data from [Bossard et al. 2014] 101K Labeled Food Images in 101 Classes

47 Data from [Bossard et al. 2014] K Labeled Food Images in 101 Classes [Krizhevsky et al. 2012] 56.4% Rank-1 Accuracy

48 Data from [Bossard et al. 2014] K Labeled Food Images in 101 Classes Transfer Learning [Krizhevsky et al. 2012] 56.4% Rank-1 Accuracy

49 Data from [Bossard et al. 2014] K Labeled Food Images in 101 Classes Transfer Learning [Krizhevsky et al. 2012] 56.4% Rank-1 Accuracy 66.8% Rank-1 Accuracy

50 Data from [Bossard et al. 2014] K Labeled Food Images in 101 Classes Transfer Learning [Krizhevsky et al. 2012] 56.4% Rank-1 Accuracy [Krizhevsky et al. 2012] 66.8% Rank-1 Accuracy

51 Data from [Bossard et al. 2014] K Labeled Food Images in 101 Classes Transfer Learning [Krizhevsky et al. 2012] 56.4% Rank-1 Accuracy [Bossard et al. 2014] [Krizhevsky et al. 2012] 66.8% Rank-1 Accuracy

52 15 Transfer learned: Yummly

53 16 Caption Quality Plausible outcomes... Higher Quality Representations

54 16 Caption Quality Plausible outcomes... Higher Quality Representations

55 17 Implications? Image Model? Language Model

56 17 Implications Mostly Coarse-Grained Information Image Model? Language Model

57 17 Implications Mostly Coarse-Grained Information Image Model? Language Model [Fang et al. 2014]

58 18 Towards Fine-Grained Captioning

59 18 Towards Fine-Grained Captioning A young girl

60 18 Towards Fine-Grained Captioning A young girl

61 18 Towards Fine-Grained Captioning A young girl [Lin et al. 2014]

62 18 Towards Fine-Grained Captioning A young girl [Lin et al. 2014] Better Loss Functions!

63 19 A few more experiments in the paper...

64 19 A few more experiments in the paper... VGGNet [Simonyan and Zisserman 2014]

65 19 A few more experiments in the paper... VGGNet vs. [Simonyan and Zisserman 2014]

66 20 Acknowledgements Lillian Lee David Mimno Jason Yosinski Serge Belongie Xanda Schofield Gregory Druck Abby Lewis CS 6670 Students Serge David Jason Xanda Gregory

67 Image Representations and New Domains in Neural Image Captioning Jack Hessel, Nic Saava, Mike Wilber

68 New domain advice: one caption per image?

69 AlexNet vs VGG-Net

arxiv: v1 [stat.ml] 23 Jan 2017

arxiv: v1 [stat.ml] 23 Jan 2017 Learning what to look in chest X-rays with a recurrent visual attention model arxiv:1701.06452v1 [stat.ml] 23 Jan 2017 Petros-Pavlos Ypsilantis Department of Biomedical Engineering King s College London