Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Multi-modal recommendation algorithm fusing visual and textual features

  • Xuefeng Hu,

    Roles Validation, Visualization, Writing – original draft

    Affiliation The School of Electronics Engineering and Computer Science, Peking University, Beijing, China

  • Wenting Yu ,

    Roles Supervision, Writing – review & editing

    729974059@qq.com

    Affiliations The State Key Laboratory of Public Big Data, Guizhou University, Guiyang, China, The College of Computer Science and Technology, Guizhou University, Guiyang, China

  • Yun Wu,

    Roles Funding acquisition, Supervision, Writing – review & editing

    Affiliations The State Key Laboratory of Public Big Data, Guizhou University, Guiyang, China, The College of Computer Science and Technology, Guizhou University, Guiyang, China

  • Yukang Chen

    Roles Writing – review & editing

    Affiliations The State Key Laboratory of Public Big Data, Guizhou University, Guiyang, China, The College of Computer Science and Technology, Guizhou University, Guiyang, China

Abstract

In recommender systems, the lack of interaction data between users and items tends to lead to the problem of data sparsity and cold starts. Recently, the interest modeling frameworks incorporating multi-modal features are widely used in recommendation algorithms. These algorithms use image features and text features to extend the available information, which alleviate the data sparsity problem effectively, but they also have some limitations. On the one hand, multi-modal features of user interaction sequences are not considered in the interest modeling process. On the other hand, the aggregation of multi-modal features often employs simple aggregators, such as sums and concatenation, which do not distinguish the importance of different feature interactions. In this paper, to tackle this, we propose the FVTF (Fusing Visual and Textual Features) algorithm. First, we design a user history visual preference extraction module based on the Query-Key-Value attention to model users’ historical interests by using of visual features. Second, we design a feature fusion and interaction module based on the multi-head bit-wise attention to adaptively mine important feature combinations and update the higher-order attention fusion representation of features. We conduct experiments on the Movielens-1M dataset, and the experiments show that FVTF achieved the best performance compared with the benchmark recommendation algorithms.

Introduction

With the rapid development of Internet applications, service delivery platforms such as e-commerce, advertising, and movie websites generate a large amount of data every day, which brings us the problem of information overload, making it difficult for users to choose among thousands of items and find the content they want. Recommendation systems emerge to solve this problem by analyzing users’ historical behaviors and digging out their preference information to personalize and recommend items for them. On the one hand, it brings great business value to the company, and on the other hand, it improves user satisfaction and makes it easier for users to find the content they want.

Click-through rate prediction (CTR) [1] is a common method in recommender systems, which predicts users’ click-through rates on candidate items by mining their interests from their historical behaviors, and ranks the candidate items to recommend items that are more likely to be of interest and clicked by the users. And feature engineering is crucial to the accuracy of click-through prediction. For example, it is reasonable to recommend cosmetics to a woman, where <gender = female, product category = cosmetics>is a very useful set of second-order features. Another example is that it is reasonable to recommend a game console to a 9-year-old boy, where <gender = male, age = 9, product category = game console>is an important set of third-order feature combinations, and these important feature combinations will bring a great improvement to the prediction performance. Traditional CTR prediction mainly relies on manually crafted features and uses shallow models for prediction, such as Logistic Regression (LR) [24]. However, manual feature engineering greatly relies on domain experts and makes it difficult to find all useful combinations of features. In recent years, with the rise of deep learning, many deep learning-based CTR prediction algorithms [58] have overwhelmingly outperformed traditional algorithms. Such algorithms automate feature engineering using deep learning algorithms, which not only replace traditional complex manual feature engineering but also can tap into deeper feature information, i.e., higher-order feature combinations. Yet, all such algorithms inevitably need to cope with data sparsity and cold start problems. For example, new users or new items have few records, resulting in parameters that cannot be adequately trained. Therefore, it is crucial for recommendation algorithms to exploit the limited records to mine users’ preferences and recommend items more accurately.

To solve the problem of sparse data and cold start in recommender systems, in recent years some studies [911] have used multi-modal techniques in recommender systems to extend the user and item features. Among them, images are very important features, which are widely present in item descriptions and have a decisive role in predicting whether a user will click on an item or not. For example, when shopping on e-commerce sites, users browse products through search or personalized recommendations, and each product is usually presented to the user by an image of the product and some text describing the product, and when the user is interested in a product, he or she clicks on the image to see the details. Images can provide intrinsic visual descriptions that are very intuitive and influence people’s decisions. When shopping online, the user’s eye always catches the image of the product, rather than reading the textual description first. Moreover, users will only decide to like or dislike a garment after seeing an image of it. Since images of items are objects that users interact with directly, these images can provide visual information about users’ preferences, so incorporating image features into the recommendation algorithm to complement the text features will have a positive impact on the accuracy of the recommendation results.

In recent years, some researchers have added image features to the models, but they generally have the following problems: (1) First, only the image features of the target items are considered in the model design, failing to consider the influence of the image features of the user’s historical clicked items on the user’s visual preferences. (2) In feature fusion and interaction, the feature vectors are usually just joined together and input to MLP(Multilayer Perceptron) for implicit feature interaction, which ignores the importance of feature interaction and does not guarantee its effectiveness.

To address the above issues, in this paper, we propose a multi-modal recommendation algorithm (FVTF) that incorporates visual and textual features. First, we note that users have different attention for different visual features of target items. Traditional methods use simple operations such as pooling or concatenation to uniformly aggregate the representation vectors of image sequences, making the learned visual preference representations lack of targeting. In order to explicitly mine users’ visual preferences, we design a user visual preference extraction module based on the Query-Key-Value attention mechanism. The module learns different user visual attention based on the visual feature of the target item, which adaptively obtains the visual preference representation of user. Meanwhile, in the feature fusion and interaction approach, we design a feature fusion and interaction module based on a multi-head bit-wise attention mechanism to mine important feature combinations and extract higher-order attention fusion representations of visual features and text features.

To summarize, we make several noteworthy contributions in this paper.

  • We design a module for user visual preference extraction based on the Query-Key-Value attention mechanism. The module takes image sequence data and visual features of target items as input, then adaptively learns users’ visual preference according to target item.
  • We design a new feature fusion and interaction module, which mines important feature combinations using a multi-head bit-wise attention mechanism, and updates the higher-order attention fusion representation of features according to the importance of the feature combinations, fully preserving and utilizing the information of features from low to high order.
  • We conducted experiments on the Movielens-1M dataset. The experimental results show that FVTF achieves the best performance compared to the state-of-the-art algorithm.

Related work

Single-modal recommendation algorithms

Early CTR prediction algorithms relied on well-designed statistical features, such as LR, Lightgbm, XgbBoost, and other algorithms. Such algorithms are unable to learn complex feature intersections and require a lot of manual feature engineering work, which makes feature engineering extremely difficult as the number of feature dimensions continues to increase. With the development of deep learning, the use of deep neural networks for building CTR prediction algorithms has greatly liberated manual feature engineering and can be used to extract complex higher-order feature interactions and improve the performance of the algorithms. Wide&Deep [5] and DeepFM [6] combine the higher-order interaction information extracted by deep neural networks with the first- and second-order interaction information extracted by LR and FM to combine the feature of low-order interaction information with higher-order interaction information for prediction, making the algorithm both memorization and generalization capabilities. algorithms such as AFM [12] and AutoInt [13] introduce attention mechanisms to distinguish the importance of feature interactions and better focus on the important feature interactions. To improve the attention mechanism and explicitly algorithm higher-order feature interactions to address the problem of uncontrollable neural network interactions, HoAFM [14] and EHAFM [15] used bit-wise attention mechanisms and designed explicit higher-order interaction algorithms, which significantly improved the algorithm performance and efficiency. However, the above algorithms only consider the features of text-modal, which can easily lead to the problems of data sparsity and cold start when there are few users and item behaviors, affecting the algorithm accuracy.

Multi-modal recommendation algorithms

To solve the problem of sparse data and cold start, image features with visual semantic information can be complemented with textual information to bring better generalization capability to the algorithm. The emergence of deep learning networks facilitates the extraction of multi-modal features. For example, Chen et al. [16] proposed ACF model based on attention mechanism and Wei et al. [17] proposed GRCN model based on graph convolutional network to explore the problem of using deep networks to learn implicit feedback in item and multi-modal feature interaction. Wei et al. [18] designed Hierarchical User Intent Graph Network to learn multi-modal features from users’ co-interacted patterns to learn multi-level user intent,so as to obtain high-quality representations of users and items and further enhance the recommendation performance.

In recent years, research on image representation has achieved remarkable results, and learning high-wise semantic features by deep learning algorithms [19, 20] is effective in a large number of tasks. In some previous works, attempts have been made to introduce image features into CTR prediction algorithms by extracting image feature representations using CNN models with pre-trained parameters and combining them with other features for click-through prediction. Chen et al. [21] proposed a DEEPCTR algorithm to train CNNs in an end-to-end manner, fuse them with text features, and input them to MLP for CTR prediction. Ge et al. [22] argue that images of the target item describe the visual characteristics of the advertisement, while images of the user’s historical behavior reveal the user’s visual preferences, and that combining this visual information yields better performance than using either of them alone. The CMBF algorithm proposed by Chen et al. [23] uses a Multihead-Self-Attention mechanism [24] to separately image features and text features learning, followed by cross-fusion of features from both modalities.

However, existing multi-modal recommendation algorithms usually use stitching to treat the extracted multi-modal features as a whole when fusing multi-modal features, ignoring the variability of user preferences in feature interactions of multiple modal features. For example, the MMGCN [25] and MGAT [26], which learns the single-modal user preferences and concatenates them to represent the multi-modal user preference on the micro-video. To tackle this problem, Wang et al. [27] design a multi-modal representation learning module to explicitly model the user’s attentions over different modalities and inductively learn the multi-modal user preference. Chen et al. [28] propose the Edge-wise mOdulation (EGO) fusion operation, which distills edge-wise multi-modal information and learns to modulate each unimodal node under the supervision of other modalities. It breaks isolated single-modal propagations and allows information to be propagated between modalities.

Our proposed method

Overview

FVTF algorithm is mainly divided into four parts: pre-processing module, user visual preference extraction module, feature fusion and interaction module, and output module, as shown in Fig 1.

First, in the feature preprocessing part, we use different methods to embed and encode the text features and image features of users and items. Then, the image embedding vector from the user’s historical behavior, and the target image’s embedding vector are fed into the user visual preference extraction module to obtain the user visual preference feature vector. Afterwards, the user visual preference feature vector and the target image embedding vector are fed into the feature fusion and interaction module to fuse and interact with the user-side and item-side text vectors respectively to obtain the user portrait and item portrait. Finally, in the output layer, the user portrait and the item portrait are joined together and input into a single-layer neural network to predict the user’s click rate by the sigmoid activation function.

Pre-processing module

To extract the information of text features and image features, we design different methods to preprocess text features and image features in the preprocessing module.

For text features, to alleviate the detrimental effects of the high dimensionality and sparsity of the feature vector on training, we use the Embedding technique [2931] for embedding dimensionality reduction of numerical features, single-valued classification features, and multi-valued classification features: (1) where xi is the element in the text feature vector, ti represents the reduced-dimensional embedding vector, wi represents the Embedding Mapping Vector in the case of the numerical feature, Wi represents the Embedding Mapping Matrix in the case of the classification feature, and Q represents the number of all potential values if xi is a multi-value feature vector.

Using the Embedding technique, we represent the set of text feature vectors tu of user u with the set of text feature vectors tv of item v as (2)

For image features, to extract the rich semantic information in them, in this algorithm we utilize an image feature extraction network based on EfficientNetB0 [32] for feature extraction. Thus, The set of feature vectors of historical click images of user u can be expressed as: (3)

The image feature vector of the target item v can be represented as Iv.

User visual preference extraction module

The display image of an item often contains some elements about the content of the item, so the user’s visual preferences are embedded in the image of the item that the user has historically clicked on. To personalize the modeling of user visual preference features, we design a user visual preference extraction module, which uses the Query-Key-Value attention mechanism to mine the correlation between user history clicked images and target item images, and use it to pool the image features of user history clicked list to extract user visual preferences, as shown in Fig 2. Specifically, The correlation between the image features of the target item v and the image features of the i-th action of user u can be expressed as: (4) (5) Where ψ(a, b) denotes the similarity between vector a and vector b. It can be calculated using neural networks, vector inner product, and other methods. In this paper, we use vector inner product to calculate the similarity because of its simplicity and effectiveness. are the query projection matrix and the key projection matrix, respectively, which map the original embedding space Rd of the feature vector to a new space , where d′ needs to be consistent with the embedding dimension of the text features to ensure that the text feature vector and the image feature vector can be fused and interacted in the later part.

Next, we obtain the visual preference feature vector Iv,u of user u based on the correlation between the image features of target item v and the image features of all actions of user u, as shown below. (6) where is the value projection matrix.

Feature fusion and interaction module

Currently, most recommendation algorithms use feature vectors of different modalities directly stitched together and input to MLP for implicit feature interaction when fusing and interacting with features of different modalities. And this approach does not distinguish the importance of different feature combinations. In addition, the effectiveness of feature interaction cannot be guaranteed due to the black-box nature of neural networks. In this paper, a new feature fusion and interaction module is designed, which consists of multiple feature fusion and interaction layers, and in each layer, the importance of each feature interaction is pooled according to the importance of each feature interaction using a multi-head bit-wise attention mechanism to increase the impact of important feature interactions and reduce the interference of useless feature interactions. And the feature higher-order fusion representation is obtained. In previous work, the effectiveness of using this method to explicitly model feature interactions is verified, which possesses better performance than vector-wise attention mechanisms such as Soft Attention [33] and multi-head Self-Attention [24].

The schematic diagram of the feature fusion and interaction layer computation method is shown in Fig 3. In multimodal feature fusion, the new text feature representation contains the information of image features, and the new image feature representation contains the information of text features. And by increasing the number of feature fusion and interaction layers, the higher-order representation of text features as well as image features can be obtained, and the valuable deep information in the features can be mined.

thumbnail
Fig 3. Schematic diagram of feature fusion and interaction module.

https://doi.org/10.1371/journal.pone.0287927.g003

Let the dimensionality of all feature vectors of the input module be d′, and in order to enhance the expressiveness of the feature vectors, the model uses a trainable projection weight matrix to map the feature vectors to a high-dimensional space and expand the feature vector dimensionality, which is calculated as shown in the following equation. (7) Where: denotes the l-th order feature vector of feature i, denotes the first-order feature vector of feature i, and is the l-order feature enhancement representation of feature i. is the l-order projection matrix, which is used to project the feature vector of dimension d′ into the D dimension.

Then, the importance of feature interactions is mined using the bit-wise attention mechanism, and pooled according to the importance of each dimension, and fused to form a higher-order representation of the features. The l-th order representation of feature i can be formulated as: (8) where: ⊙ denotes the Hadamard product of two vectors,(a1, b1, c1) ⊙ (a2, b2, c2) = (a1a2, b1b2, c1c2), and we use the Hadamard product of two feature vectors to denote the interaction of two vectors; denotes the (l − 1)-th order of feature i denotes the attention weight vector interacting with feature j; M is the number of feature vectors.

To avoid overfitting, we create multiple attention heads that do not share parameters for feature interaction importance learning, learn feature interaction importance in different representation subspaces, respectively, and finally integrate the output of multiple attention heads, which is formulated as follows: (9) (10) (11) where: ⊕ is the jointing operation, H is the total number of attention heads, and the weight parameters of each attention head are independent of each other.

To combine the information from multiple attention heads, we perform a linear transformation to obtain a vector el of l-th order representations of all features: (12) Where: μ(•) is the activation function, and the LeakyReLU activation function is chosen in this paper; is the projection vector, which transforms the representation vector from DH dimension to D dimension.

By stacking multiple feature fusion and interaction layers, the 1-st to l-th order representation tensor Z = [e1, e2, …, e1+1] ∈ Rl×M×D of the features is obtained. finally, the tensor Z is flattened to obtain the output p of the feature fusion and interaction module, and for the convenience of representation, the feature fusion and interaction module is expressed functionally as follows: (13)

The features of the input module are divided into user-side features and item-side features. First, all the text feature vectors of user-side and user visual preference feature vectors are formed into a user features set eu, as shown in the following: (14)

Then, the user feature set will be input to the feature fusion and interaction module to obtain the user portrait feature pu. (15)

For item-side features, the target item image feature vector is first mapped to an embedding representation space of the same dimension as the text feature vector, as shown in the following equation: (16) where WvRd′×d, d is the text feature vector dimension and d′ is the target item image feature vector dimension.

The set of item features is formed by combining all text feature vectors with the target item image feature vectors ev, and will be fed into the feature fusion and interaction module to obtain the item portrait features pv, as fowllows: (17) (18)

Output layer

In the output layer, we splice the user portrait feature pu with the item portrait feature pv and learn the relationship between pu and pv through a feed forward neural network to predict the user’s click rate on the target item, which is formulated as follows: (19) (20) (21) where: ⊕ denotes two vectors for jointing operation; l denotes the number of layers of the feedforward neural network; ah, Wh, bh denote the output of the h-th layer, the weight matrix, and the bias vector, respectively.

Finally, the output of the last layer is passed through the Sigmoid activation function for CTR prediction.

Click-through rate prediction is a dichotomous task (click, unclick), so the loss function of this algorithm uses logarithmic loss, and in addition, we add an L2 regularization term to the algorithm to prevent overfitting problems: (22) Where: N is the total number of training samples, y is the sample true label (1 or 0), is the predicted value of the sample, ϕ is the set of trainable parameters of the algorithm, and λ is the manually adjustable L2 regularization factor.

Experiments

Dataset

This experiment uses the Movielens-1M [34] dataset, a set of movie rating data collected by GroupLens Research from MovieLens users from the late 1990s to the early 2000s, containing about 4000 movies from about 6000 users with about 1 million user rating records. It contains data on users’ ratings of movies, users’ gender, age, occupation, movie genre, and movie era. The dataset lacks movie image information, but provides the imdbID of each movie. based on the imdbID, we can look up the corresponding movie information on the IMDB website, which contains movie posters, and we use crawling techniques to get the corresponding posters of each movie.

For data preprocessing: for text features, the null values are filled with -1. In the user history behavior sequence construction, the five click records before the user’s behavior with the target item according to the timestamp are used as the user history behavior sequence to capture the user’s recent preferences. Since some movies failed to obtain posters, we delete this part of user behavior data. We preprocessed the image information and changed the size to 64×64×3, and used the pre-trained EfficientNetB0 model on ImageNet (ISVRC2012) as the image feature extraction module to extract the image features of movie posters. For the sample labels, the samples with user ratings greater than 3 in the Movielens-1M dataset are regarded as positive samples (label = 1) and those less than 3 are regarded as negative samples (label = 0), and the neutral samples with a rating of 3 are removed. The data set was randomly disrupted and partitioned into the training set, validation set, and test set according to 8:1:1, and the information of the pre-processed data set is shown in Table 1.

Benchmark algorithms

We compare FVTF with the following algorithms, which can be divided into classical algorithms used before the emergence of deep networks, single-modal recommendation algorithms based on deep learning, and deep learning recommendation algorithms that incorporate multi-modal features.

(1)Classic algorithms before the advent of deep neural networks

  • LR [2]: logistic regression, an algorithm that uses only first-order features for prediction.
  • FM [4]: Factorization Machines, which represents second-order feature interactions using the dot product of two feature vectors.

(2)Single-modal recommendation algorithms based on deep learning.

  • Wide&Deep [5]: combining logistic regression for extracting first-order features as the Wide part and deep neural network for extracting higher-order features as the Deep part, making the model both memorization and generalization capabilities.
  • DeepFM [6]: a model that combines a factorization machine and a deep neural network for extracting second-order interaction features and higher-order interaction features, respectively.
  • xDeepFM [7]: compression of the feature interaction tensor using convolutional neural networks, capable of explicitly modeling higher-order interactions at the vector wise.
  • AutoInt [13]: updates the feature interaction representation using a multi-head self-attentive mechanism.
  • HoAFM [14]: explicitly models feature higher-order interactions based on a bit-wise attention mechanism.
  • EHAFM [15]: expands the feature representation dimension based on HoAFM and adds multiple attention heads to enrich the feature representation information.

(3) Deep learning recommendation algorithms incorporating multi-modal features

  • MLFM [10]: is a multi-modal post-fusion classification method based on text and images. It uses machine learning models to extract text and image features, learns specific classifiers for each modality, and then learns fusion strategies from the results of each modality classifier.
  • VBPR [11]: incorporates visual information into the recommendation model, which is a significant improvement over the matrix decomposition model that relies on the hidden vectors of users and items.
  • DICM [22]: this method extracts user visual preference features using the item image features of the user’s historical behavior and the image features of the target item, stitches them with the user’s ID features and the image features of the target item, and inputs them into a deep neural network for click-through rate prediction.
  • CMBF [23]: They proposed a new cross-modal fusion method based on the Multihead-Self-Attention mechanism for the complete fusion of multi-modal features to learn cross-information between different modalities.

(3) Deep learning recommendation algorithms incorporating multi-modal features

  • MLFM [10]: is a multi-modal post-fusion classification method based on text and images. It uses machine learning models to extract text and image features, learns specific classifiers for each modality, and then learns fusion strategies from the results of each modality classifier.
  • VBPR [11]: incorporates visual information into the recommendation model, which is a significant improvement over the matrix decomposition model that relies on the hidden vectors of users and items.
  • DICM [22]: this method extracts user visual preference features using the item image features of the user’s historical behavior and the image features of the target item, stitches them with the user’s ID features and the image features of the target item, and inputs them into a deep neural network for click-through rate prediction.
  • CMBF [23]: They proposed a new cross-modal fusion method based on the Multihead-Self-Attention mechanism for the complete fusion of multi-modal features to learn cross-information between different modalities.

Experimental setup

The algorithm is implemented using Tensorflow 2.0. In the parameter settings, the Embedding dimension of text features is set to 16 and the Embedding dimension of image features is set to 128 in the preprocessing module; In the feature fusion and interaction module, the number of layers of feature interaction layer is set to 4, the number of attention heads is set to 2, and the feature fusion dimension is set to 32. In the output layer, a neural network with 384 neurons is used for learning; and the L2 Regularization factor is set to 0.0001. The above parameters are the optimal parameters obtained from the experiments.

In the training strategy, Adam [35] is used as the optimizer, and the batchsize is set to 1024. 30 epochs are trained with a Learning Rate of 0.01, and then 10 epochs are trained with a lower Learning Rate of 0.001, and an early stopping strategy is used to prevent overfitting. As shown in Table 2.

Evaluation metrics

The experiments in this paper use AUC and Logloss as evaluation metrics.

AUC (Area under the curve): AUC is the area under the ROC curve, which measures the probability that the positive cases are ranked in front of the negative cases in the algorithm prediction, and the closer its value is to 1, the better the dichotomous classification performance and the indicator is insensitive to whether the sample categories are balanced or not, and can still make reasonable predictions in the case of sample imbalance, which is the most important evaluation index in the click-through prediction task.

Logloss: Logloss is also a widely used evaluation criterion in dichotomous classification tasks, which can measure the degree of fit between the predicted CTR and the actual CTR, the lower the Logloss the better the fit ability.

Experimental results and analysis

Performance comparison.

We compare the performance of FVTF with twelve benchmark algorithms. The results are shown in Table 3. We summarize the experimental analysis as follows.

Compared with traditional classical algorithms such as LR and FM, the algorithms based on deep learning achieve a great performance improvement. This indicates that deep networks have a stronger ability to acquire interaction features compared to manual feature engineering. Meanwhile, it can be seen that multi-modal recommendation algorithms have better performance compared to single-modal. This indicates that introducing features of different modalities into the algorithmic framework effectively extends the available information to extract more information about interaction features, which positively contributes to improving the algorithm performance.

Our proposed FVTF achieves the best performance. On the one hand, compared with classical algorithms and single-modal recommendation algorithms based on deep learning, This is because FVTF utilizes multimodal information to alleviate data sparsity, incorporates item image features to complement the features, and captures user preferences from multiple modal features to provide more accurate recommendations. On the other hand, compared with multi-modal algorithms, FVTF also has performance improvement. Particularly, compared with DICM, FVTF improves by 3.3% in terms of AUC. This confirm that the bit-wise attention based adaptive feature fusion approach is more effective and flexible than the simple fusion (i.e., concatenation) to obtain interaction features.

Training time comparison.

We compared FVTF with other recommendation algorithms in terms of efficiency, comparing the training time of each epoch, and the experimental results are shown in Fig 4. Compared with most single-modal algorithms, FVTF takes a longer time to train an epoch, but such algorithms ignore image features and have poorer prediction performance. However, compared with other deep learning recommendation algorithms that fuse multi-modal features, our algorithm takes much less time to train an epoch than them and achieves optimal performance. This is due to the low complexity of the feature fusion and interaction module we designed, and the fact that separating user features from item features for feature interaction reduces non-essential computations.

Ablation experiments.

We conducted ablation experiments on FVTF on Movielens-1M to verify the effectiveness of the feature fusion and interaction method designed in our paper and analyzed the effects of the number of feature interaction layer layers, feature fusion dimension, and the number of attention heads on the model performance.

As shown in Table 4, we can see that compared with the Multihead-Self-Attention and MLP methods, our designed feature fusion and interaction method achieves better performance, with 1.4% improvement in AUC and 4.5% reduction in Logloss compared with Multihead-Self-Attention, which verifies the This validates the effectiveness of the method. And we can see that the performance of the model improves as the number of layers increases when the feature interaction layer is ≤4, but decreases when the feature interaction layer is >4. This may be because the combination of features above the 5th order has less information useful for prediction and tends to lead to overfitting.

thumbnail
Table 4. Performance comparison of multimodal feature fusion approaches.

https://doi.org/10.1371/journal.pone.0287927.t004

As shown in Fig 5, the model performance is improved as the feature fusion dimension increases, and the model effect tends to be flat when the feature fusion dimension is ≥32. This is because when the feature fusion dimension increases, the higher-order representation of features can accommodate more information of other features, making the fused representation of features more expressive.

thumbnail
Fig 5. Influence of feature fusion dimension on algorithm performance.

(a) AUC. (b) Logloss.

https://doi.org/10.1371/journal.pone.0287927.g005

As shown in Fig 6, we compare the performance of the model at different numbers of attention heads. We can see that when the number of attention heads is 2, there is a significant improvement compared to a single attention head. The reason is that using multiple attention heads can integrate information from multiple subspaces and improve the generalization ability of the model. However, when the number of attention heads is ≥3, the AUC and Logloss tend to be flat, and even suffer from overfitting problems, which affects the final effect of the model.

thumbnail
Fig 6. Influence of feature fusion dimension on algorithm performance.

(a) AUC. (b) Logloss.

https://doi.org/10.1371/journal.pone.0287927.g006

Conclusion

In this paper, we propose a multi-modal recommendation algorithm (FVTF) that fuses visual features with textual features. FVTF extracts valuable visual information from user history behavior sequences and target items as a complement of textual features to alleviate data sparsity and cold start problems in recommender systems. The results of our experiments on the Movielens-1M dataset show that FVTF has optimal performance and high efficiency compared to benchmark algorithms. In addition, the results of ablation experiments show that our approach is much better than MLP and Multi-head Self-Attention.

In the future, there are two main considerations: (1) since FVTF only utilizes information from two modalities (image and text), we will consider adding features from more modalities (e.g., sound, video, etc.); (2) since FVTF only considers a few historical click behaviors closest to the time of the target item, we will consider modeling long- and short-term visual preferences to further improve the algorithm performance.

References

  1. 1. Wang X. A Survey of Online Advertising Click-Through Rate Prediction Models. In: 2020 IEEE International Conference on Information Technology, Big Dataand Artificial Intelligence (ICIBA). vol. 1. IEEE; 2020. p. 516–521.
  2. 2. Jiang Z, Gao S, Dai W. A CTR prediction approach for text advertising based on the SAE-LR deep neural network. Journal of Information Processing Systems. 2017;13(5):1052–1070.
  3. 3. Richardson M, Dominowska E, Ragno R. Predicting clicks: estimating the click-through rate for new ads. In: Proceedings of the 16th international conference on World Wide Web; 2007. p. 521–530.
  4. 4. Rendle S. Factorization machines. In: 2010 IEEE International conference on data mining. IEEE; 2010. p. 995–1000.
  5. 5. Cheng HT, Koc L, Harmsen J, Shaked T, Chandra T, Aradhye H, et al. Wide & deep learning for recommender systems. In: Proceedings of the 1st workshop on deep learning for recommender systems; 2016. p. 7–10.
  6. 6. Guo H, Tang R, Ye Y, Li Z, He X. DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:170304247. 2017.
  7. 7. Lian J, Zhou X, Zhang F, Chen Z, Xie X, Sun G. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; 2018. p. 1754–1763.
  8. 8. Li Z, Cheng W, Chen Y, Chen H, Wang W. Interpretable click-through rate prediction through hierarchical attention. In: Proceedings of the 13th International Conference on Web Search and Data Mining; 2020. p. 313–321.
  9. 9. Cai JJ, Tang J, Chen QG, Hu Y, Wang X, Huang SJ. Multi-View Active Learning for Video Recommendation. In: IJCAI; 2019. p. 2053–2059.
  10. 10. Wu C, Wu F, An M, Huang J, Huang Y, Xie X. Neural news recommendation with attentive multi-view learning. arXiv preprint arXiv:190705576. 2019.
  11. 11. He R, McAuley J. VBPR: visual bayesian personalized ranking from implicit feedback. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 30; 2016.
  12. 12. Xiao J, Ye H, He X, Zhang H, Wu F, Chua TS. Attentional factorization machines: Learning the weight of feature interactions via attention networks. arXiv preprint arXiv:170804617. 2017;.
  13. 13. Song W, Shi C, Xiao Z, Duan Z, Xu Y, Zhang M, et al. Autoint: Automatic feature interaction learning via self-attentive neural networks. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management; 2019. p. 1161–1170.
  14. 14. Tao Z, Wang X, He X, Huang X, Chua TS. HoAFM: a high-order attentive factorization machine for CTR prediction. Information Processing & Management. 2020;57(6):102076.
  15. 15. Chen Y, Long H, Wu Y, Jian L. Click-through rate prediction model of Enhanced High-order Attentive Factorization Machine. Computer Engineering and Applications. 2021; p. 1–10.
  16. 16. Chen J, Zhang H, He X, Nie L, Liu W, Chua TS. Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’17. New York, NY, USA: Association for Computing Machinery; 2017. p. 335–344.
  17. 17. Wei Y, Wang X, Nie L, He X, Chua TS. Graph-Refined Convolutional Network for Multimedia Recommendation with Implicit Feedback. In: Proceedings of the 28th ACM International Conference on Multimedia. MM’20. New York, NY, USA: Association for Computing Machinery; 2020. p. 3541–3549.
  18. 18. Wei Y, Wang X, He X, Nie L, Rui Y, Chua TS. Hierarchical User Intent Graph Network for Multimedia Recommendation. IEEE Transactions on Multimedia. 2022;24:2701–2712.
  19. 19. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–778.
  20. 20. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 2012;25:1097–1105.
  21. 21. Chen J, Sun B, Li H, Lu H, Hua XS. Deep ctr prediction in display advertising. In: Proceedings of the 24th ACM international conference on Multimedia; 2016. p. 811–820.
  22. 22. Ge T, Zhao L, Zhou G, Chen K, Liu S, Yi H, et al. Image matters: Visually modeling user behaviors using advanced model server. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management; 2018. p. 2087–2095.
  23. 23. Chen X, Lu Y, Wang Y, Yang J. CMBF: Cross-Modal-Based Fusion Recommendation Algorithm. Sensors. 2021;21(16):5275. pmid:34450716
  24. 24. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in neural information processing systems; 2017. p. 5998–6008.
  25. 25. Wei Y, Wang X, Nie L, He X, Hong R, Chua TS. MMGCN: Multi-Modal Graph Convolution Network for Personalized Recommendation of Micro-Video. In: Proceedings of the 27th ACM International Conference on Multimedia. MM’19. New York, NY, USA: Association for Computing Machinery; 2019. p. 1437–1445.
  26. 26. Tao Z, Wei Y, Wang X, He X, Huang X, Chua TS. MGAT: Multimodal Graph Attention Network for Recommendation. Information Processing & Management. 2020;57(5):102277.
  27. 27. Wang Q, Wei Y, Yin J, Wu J, Song X, Nie L. DualGNN: Dual Graph Neural Network for Multimedia Recommendation. IEEE Transactions on Multimedia. 2023;25:1074–1084.
  28. 28. Chen F, Wang J, Wei Y, Zheng HT, Shao J. Breaking Isolation: Multimodal Graph Fusion for Multimedia Recommendation by Edge-Wise Modulation. MM’22. New York, NY, USA: Association for Computing Machinery; 2022. p. 385–394.
  29. 29. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781. 2013;.
  30. 30. Barkan O, Koenigstein N. Item2vec: neural item embedding for collaborative filtering. In: 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE; 2016. p. 1–6.
  31. 31. Grbovic M, Cheng H. Real-time personalization using embeddings for search ranking at airbnb. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; 2018. p. 311–320.
  32. 32. Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. PMLR; 2019. p. 6105–6114.
  33. 33. Xiao T, Xu Y, Yang K, Zhang J, Peng Y, Zhang Z. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 842–850.
  34. 34. Harper FM, Konstan JA. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis). 2015;5(4):1–19.
  35. 35. Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014;.