登录
OAmaster

Imagine that you are working on a text classification model which detects patterns within strings. As an initial step, you should find the longest contiguous substring within a given string source, consisting of lowercase English letters. Your task is to find the longest contiguous substring of source consisting of the same character within source. If there are several substrings of the same length that meet this criterion, the rightmost one. Return a string consisting of this character concatenated with its number of occurrences in the longest contiguous substring (i.e., the length of the substring).

  • source = "bbbeed""b3"
  • source = "ccc""c3"

Constraints

1 ≤ source.length ≤ 100.

解法

双指针扫游程:对每段游程,若更长,或等长但起点严格更靠右(i > best_start),就更新最优。复杂度 O(n)

def solution(source: str) -> str:
    n = len(source)
    best_char, best_len, best_start = '', 0, -1
    i = 0
    while i < n:
        j = i
        while j < n and source[j] == source[i]: j += 1
        cur_len = j - i
        if cur_len > best_len or (cur_len == best_len and i > best_start):
            best_len, best_char, best_start = cur_len, source[i], i
        i = j
    return f"{best_char}{best_len}"
class Solution {
    static String solution(String source) {
        int n = source.length(), bestLen = 0, bestStart = -1;
        char bestChar = ' ';
        int i = 0;
        while (i < n) {
            int j = i;
            while (j < n && source.charAt(j) == source.charAt(i)) j++;
            int curLen = j - i;
            if (curLen > bestLen || (curLen == bestLen && i > bestStart)) {
                bestLen = curLen; bestChar = source.charAt(i); bestStart = i;
            }
            i = j;
        }
        return "" + bestChar + bestLen;
    }
}
#include <string>
using namespace std;

string solution(string source) {
 int n = source.size(), bestLen = 0, bestStart = -1;
 char bestChar = ' ';
 int i = 0;
 while (i < n) {
 int j = i;
 while (j < n && source[j] == source[i]) j++;
 int curLen = j - i;
 if (curLen > bestLen || (curLen == bestLen && i > bestStart)) {
 bestLen = curLen; bestChar = source[i]; bestStart = i;
 }
 i = j;
 }
 return string(1, bestChar) + to_string(bestLen);
}

Implement the missing code, denoted by ellipses. You may not modify the pre-existing code.

Your task is to implement parts of a Bagging algorithm from scratch (i.e., without importing any libraries or packages besides those already imported). As a reminder, Bagging classification comprises three major steps:

  1. Bootstrap the train sample for each base classifier.
  2. Train each base classifier based on its own bootstrapped samples.
  3. Assign class labels by majority vote of the base classifiers.

To validate the algorithm implementation, you will need to use it for some classification tasks. You will be given x_train (2D float array), y_train (1D int array of labels), and x_test.

Notes:

  • It is guaranteed that all training and test data will be float values.
  • In case of a tie in the voting, select the class label with the lower number.
  • For bootstrapping: n indices generated per classifier, where n is the number of training samples. Each index i generated by randint with 0 <= i < n. Indexes could repeat. seed should be used only in its initial place.

解法

Bootstrap = 有放回抽样(每个分类器 n 次);每棵 DecisionTreeClassifier 在自己的 bootstrap 样本上训练;用多数投票预测,平局取类别标号最小者。ML 题,只给 Python(依赖 sklearn)。

from sklearn.tree import DecisionTreeClassifier
from random import randint, seed

def bootstrap(X: list, y: list) -> tuple:
    n = len(X)
    idx = [randint(0, n - 1) for _ in range(n)]
    return [X[i] for i in idx], [y[i] for i in idx]

def fit(classifiers, X, y):
    for clf in classifiers:
        bx, by = bootstrap(X, y)
        clf.fit(bx, by)

def predict(classifiers, X):
    out = []
    for sample in X:
        votes = {}
        for clf in classifiers:
            p = int(clf.predict([sample])[0])
            votes[p] = votes.get(p, 0) + 1
        best_count = max(votes.values())
        winners = [c for c, v in votes.items() if v == best_count]
        out.append(min(winners))
    return out

def solution(x_train, y_train, x_test, n_estimators):
    seed(0)
    classifiers = [DecisionTreeClassifier(random_state=0) for _ in range(n_estimators)]
    fit(classifiers, x_train, y_train)
    return predict(classifiers, x_test)

Implement the missing code, denoted by ellipses. You may not modify the pre-existing code.

Your task is to implement parts of a k-Means Clustering algorithm from scratch, given an initial set of centroids.

Calculate the distance between each data point and the k centroids. Assign each data point to its nearest centroid. Average the data points in each cluster to update the centroids' locations and repeat for a set number of iterations. Return the predicted cluster for each data point.

Note: If a particular cluster is not assigned data points in one iteration, do not include that cluster in the next iteration. Any different cluster enumeration is also considered a valid answer.

Constraints

1 ≤ |data| ≤ 10000, 2 ≤ k ≤ 20, 1 ≤ iterations ≤ 500.

Example: data=[[1,2],[1,4],[1,0],[10,2],[10,4],[10,0]], k=2, centroids=[[1,2],[10,2]], iterations=10[0,0,0,1,1,1].

解法

标准 Lloyd 算法:算两两距离(scipy.cdist),每个点分到最近质心,质心更新为簇均值。空簇在下一轮丢弃。ML 题,只给 Python(依赖 scipy)。

from scipy.spatial.distance import cdist
from typing import List

def calculate_distance(data: List[List[float]], centroids: List[List[float]]) -> List[List[float]]:
    return cdist(data, centroids, metric='euclidean').tolist()

def get_clusters(data: List[List[float]], centroids: List[List[float]]) -> List[int]:
    d = calculate_distance(data, centroids)
    return [row.index(min(row)) for row in d]

def update_clusters(clusters: List[int], data: List[List[float]], k: int) -> List[List[float]]:
    dim = len(data[0])
    new_centroids = []
    for c in range(k):
        members = [data[i] for i in range(len(data)) if clusters[i] == c]
        if not members:
            continue
        avg = [sum(m[j] for m in members) / len(members) for j in range(dim)]
        new_centroids.append(avg)
    return new_centroids

def fit_predict(data, k, centroids, iterations):
    cur = [list(c) for c in centroids]
    for _ in range(iterations):
        clusters = get_clusters(data, cur)
        cur = update_clusters(clusters, data, len(cur))
    return get_clusters(data, cur)

def solution(data, k, centroids, iterations):
    return fit_predict(data, k, centroids, iterations)
Pro 会员

解锁剩余 4 道 OA 真题

开通后立即解锁完整题面 + Python / Java / C++ 三语题解。 全站 165 家公司 · 1000+ 道 OA · 月度更新。